[ code , comp-sci ]

One of the most difficult challenges faced by a computational physicist is designing your software in a user-friendly way. Most masters (or PhD, for that matter) students will present any code they have written accompanied by a profuse set of apologies for the mess that the receiver is inheriting, but this is a problem for several scientific reasons -

Good science should be independently verifiable, so there is a moral obligation to make code accessible for others to read, understand and run to check your work.
Good code is easier to modify- investing a few weeks into setting the problem up robustly can translate to massive time savings later, particularly if someone else starts maintaining your code
Good practices make you more confident that the result you have is correct: Making code modular allows you to test different subsections independently, rooting out bugs before they cause weird behaviour somewhere else.

Living up to these lofty ideals is not so easy.

By definition, you do not know what the final form of research code will look like when you start writing it.
Scientists usually work individually, meaning that there are are no ongoing checks protecting against bad practice or rewarding good ones.
Writing good code is harder than writing bad code.
Every research problem is unique, making it hard to modularise your code.
It requires substantial experience to know which paradigms from the wider world of programming are helpful.

This post is the beginning of a series, in which I want to present my own perspective on how to tackle the non-physics parts of computational physics - which language to choose, how to structure folders, how to structure your project in such a way that it is easily comprehensible to yourself and others. A tentative plan is the following:

I Directory structure and input files
II Choice of language
III My way of amalgamating output files in a sensible way
IV Plotting code

The Toy Models

For concreteness, I will consider two different physics problems to illustrate the different approaches that can be used -

Model A: A deterministic Schrödinger equation solver. ‘Exactly’ diagonalises the Hamiltonian for n interacting particle in an arbitrary 1D potential well V(x).

Inputs:

V(x): The potential well.
n: The number of particles.
n_states: The number of eigenvalues we want to find (0 = ground state only)

Outputs:

Energy/ies
Wavefunction/s

Model B: Spin-Wave Theory. Find the classical ground state of a lattice, expand the ground state in Holstein-Primakoff bosons, truncate to next-leading order and diagonalise the Hamiltonian. Then predict the spin correlators seen by neutron scattering.

Inputs:

Lattice, describing the sites and connectivity of the magnetic sites
corrtype, specifying which correlators to probe (the relevant one depends on the kind of spins being investigated)
BZpath, the path through the Brillouin zone to plot
B, the external, applied magnetic field

Outputs:

Magnon dispersion along a specific curve in $k$ space
Spectral weight as seen in a single crystal scattering experiment
Spectral weight as seen by a powder diffraction experiment

The Zen of Physics Code

Before I met Jeanine, my life was cosmically a shambles, it was ah… I was using bits and pieces of whatever Eastern philosophies happened to drift through my transom and she sort of sorted it out for me, straightened it out for me, gave me a path, you know, a path to follow.

- David St. Hubbins, This is Spinal Tap

Make it obvious where data comes from.
Make it obvious what programs do.
Programs should do one thing and do it well.
At most three shell commands should be involved in transforming parameters into a publishable graph.
Label everything.

The general idea here is to write executables and datafiles that can be read as verbs and nouns that describe the whole calculation:

[ Find the two-body Hamiltonian ] of ( n electrons ) and [ diagonalise it ]

[ Find the classical ground state ] of ( spins on a lattice ), 
[ expand to second order in Holstein-Primakoff bosons ] and
[ find the corresponding spin correlators ], 
then [ average over them with Monte Carlo ]

Each item in [ brackets ] can be interpreted as a verb, and arguably deserves its own executable, and the item in ( parentheses ) should have an input file associated with it. Obviously there’s a judgement call to make here when you decide how to modularise everything - if the code is fast, it can be more appropriate to have a ‘god verb’ that does everything. These guidelines are intended for a fairly specific use case, a complicated, multi-step calculation that technically only needs to run once.

Less abstractly, these maxims translate to

Every datafile in your output should have the input that made it and the context to interpret it saved with it. If you are producing a 2D ‘heatmap’ output, you should save the 1D X axis and Y axis information in the same place that you saved the Z axis matrix.
Write your ‘verbs’ in such a way that they can be used without reading the source code. Instead of crashing when you use an incorrect call sequence, gently remind the user how it’s meant to be called with a message. Use at most 4 command line parameters: any more, and they should be in a file.
Your code does not have to fit into one executable. If your simulation involves precalculating some parameters then running a Monte Carlo simulation, you should have two executables calc_params and run_sim, where the latter is fed output from the former. The intermediate data should be saved to a file (with metadata), as per [1].
Your project should have at most three levers - “run simulation”, “process data” and “plot data”. (A good way to achieve this is with shell or python scripts).
Magic numbers - in the source code, or in the in/out files, are not allowed.

Directory structure

This is more art than science, but if you’re writing your first big project it can seem an insurmountable obstacle to even establish where to start.

The thing to have in mind is this: For science code, the main tasks you’ll be doing are

Writing the simulation
Creating new simulations with slightly different parameters
Plotting the output data in new, interesting ways

The goal now is to minimise the hassle that these tasks cause.

Example: Model A

mdA + .gitignore
    + Makefile
    + README.md
    + _driver ----- + simulate.sh 
    |               + process.sh
    + _process ---- + plot_wavefunc.py
    |               + plot_momentum.py
    |               + correlate.ipynb
    |               + img --------- + wavefun_onepart.pdf
    |                               + wavefun_twopart.pdf
    + bin --------- + simwavfunc
    |               + simspinfunc
    |               + spinwavfunc_field
    + src --------- + sim.h
    |               + sim.c
    |               + ...
    + input -----   + one_part.inp
    |               + two_part.inp
    |               + potentials -- + harmonic.csv
    |                               + quartic.csv
    |                               + free.csv
    + data   ------ + one_part ---- + _sim_params.inp
                    |               + _potential.csv
                    |               + rcorr.csv
                    |               + pcorr.csv
                    |               + wavefunc.csv
                    |               + energy.csv
                    + two_part ---- + _sim_params.inp
                                    + _potential.csv
                                    + rcorr.csv    
                                    + pcorr.csv
                                    + wavefunc.csv
                                    + energy.csv

I’m using underscores to push the most important things to the top of ls output. Creating and running a simulation might then look like

cd mdA
cp input/two_part.inp input/three_part.inp
vi input/three_part.inp # edit the input parameters...
bin/simwavfunc input/three_part.inp data/three_part
# wait until finished...
python3 _process/plot_wavefunc.py data/three_part

The bash scripts simulate.sh and plot.sh are the last things to be added before publishing. They contain, respectively

#!/bin/bash
# simulate.sh
mkdir data

for sim in `ls input`; do
    # remove the extension to get the folder name
    s= "${sim%.*}"
    mkdir "data/$s"
    bin/simwavfunc "input/$sim" "data/$s"
done

#!/bin/bash
# process.sh
mkdir _process/img
for dir in `ls data`; do
    python3 _process/plot_wavefunc.py "data/$dir"
done

These are the two levers of maxim [4].

Note _sim_params.inp in the output folders. These are copies of one_part.inp and two_part.inp, which are produced by the simulation executable. This is helpful for many reasons:

In case you change any of the input data while your code is running (very easy to do if it spans multiple days)
To keep metadata with data (maxim [1])
To reassure you that the program is reading the parameters correctly.
To keep all the data that a post-processing script needs in one place.

Some extra benefits of this structure are:

The programs involved never have to traverse system directories. This is helpful for code readability and cross-platform compatibility.
The data folder could be called anything and located anywhere (even on a networked drive!)
We don’t need any fancy programs or file parsers to read everything that happens.

If necessary, judicious use of tar can compress the data files if you need to transfer them between different computers. (Cheat sheet: tar -scvf MyDir.tar.gz MyDir to compress, tar-xvf MyDir.tar.gz to unzip. It’s unlikely that you would ever need to do anything else with tar.)

Example: Spin Wave theory

The requirements here are a bit different. One of our inputs is a 3D lattice, which has complex structure that can’t be represented in a simple CSV / .inp file. In this case, it’s more appropriate to use a “real” format (e.g. TOML) to store the complicated input data (see below for details).

mdB + .gitignore
    + Makefile
    + README.md
    + _process ---- + _fit_cote_0p05.py
    |               + _fit_cote_1p70.py
    |               + fitting.ipynb
    |               + correlate.ipynb
    |               + spinW.py
    + fig --------- + powder_honeycomb_realistic.pdf
    |               + dispersion_honeycomb_realistic.pdf
    |               + ...
    + bin --------- + spinwave
    |               + paverage
    + src --------- + sim.h
    |               + sim.c
    |               + ...
    + experimental  + CoTe_sqw_0p05K.csv
    |               + CoTe_sqw_1p70K.csv
    |               + ...
    + lattice ----- + honeycomb.toml
                    + cubic.toml
                    + pyrochlore.toml

This one’s a bit different. The use case here is fitting experimental data using the spinwave toolchain - the .ipynb notebooks function for experimentation, but the final data is produced by the python scripts _fit_cote_0p05.py,_fit_cote_1p70.py. The compiled binaries are called by the library spinW.py.

The input file

It’s really tempting to write input files quick and dirty. For example:

# input.in
# first run, single particle test in a harmonic trap
1 # number of particles
11 # number of sites
25 16 9 4 1 0 1 4 9 16 25 # potential (must have n = number of sites)
1e-4 # interaction strength

or even more mystically

1
11
25 16 9 4 1 0 1 4 9 16 25
1e-4

The last two structures are “simple” to parse in a C-like language -

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_LINE_LEN 1024

void parse_file(
    const char* fname,
    uint16_t* nparts,
    uint16_t* nsites,
    double** V,
    double* istrength
    ) {

    FILE* fp = fopen(fname, "r");
    char linebuf[MAX_LINE_LEN];
    // read first line
    fgets(linebuf, MAX_LINE_LEN, fp);
    *nparts = atoi(linebuf);
    // read second line
    fgets(linebuf, MAX_LINE_LEN, fp);
    *nsites = atoi(linebuf);

    *V = (double *) malloc(sizeof(double)* (*nsites));
    // read third line V into memory
    fgets(linebuf, MAX_LINE_LEN, fp);
    // assume space delimiters
    char *token = strtok(linebuf, " ");

    uint16_t idx=0;
    while( token != NULL ) {
        (*V)[idx] = atof(token);
        token = strtok(NULL, " ");
        idx++;
        if (idx >= *nsites) break; // make sure we don't overrun!
    }

    // read fourth line
    fgets(linebuf, MAX_LINE_LEN, fp);
    *istrength = atof(linebuf);
}

From a scientific perspective, the code is very dangerous. Suppose you miscount and write

1
15
25 16 9 4 1 0 1 4 9 16 25
1e-4

Then the program will silently allocate more space than is needed for V, and will set them to zero (if the OS is being polite), or even worse, to whatever crap was lying in the ram before your program got hold of it!

If we forget the line order in the following way,

1
1e-4
25 16 9 4 1 0 1 4 9 16 25
11

It’s not clear what atoi will do now, but it definitely isn’t the right thing, and there will again be no warnings about this.

#define MAX_LINE_LEN 1024 might also become a problem if we have a somewhat large potential - that’s only enough to fit ~40 plaintext doubles with full precision.

None of these issues are unfixable, it’s just a massive time-sink trying to sort these annoying data validation issues out. I propose a few ideas for alternatives below.

The worst option for handling input is using unlabeled command line arguments.

./sim 1 11 25 16 9 4 1 0 1 4 9 16 25 1e-4

This is completely opaque, and is a flagrant violation of [1] and [2]. In my opinion, positional command line arguments should never be used to specify physical parameters - it’s just too easy to forget which place means what.

Rolling your own parser code

This can be defensible if you only have a small number of low-complexity inputs. There’s something to be said for avoiding dependencies on libraries. Some guidelines for doing this though:

Write the input file before you write the parser.
Use natural-feeling delimiters like =, :
Do it once, do it right, never worry about it again.

My own homebrew parser is on GitHub if you’re interested. Its interface works like this:

#include <parameter.hh>

// Use template magic to declare what types our numeric parameters are
typedef parameters<int, unsigned, double> param_t;

int main (int argc, char** argv) {
    // ---------- Reading parameters ----------
    // Required parameters
    if (argc < 4){
        fprintf(stderr, "Required parameters missing\n");
        return EXIT_FAILURE;
    }

    int n;
    double T;
    
    // Optional parameter variables
    unsigned
        sample = 2048,
        sweep  = 4,
        burnin = 128,        
        n_time = 512, 
        n_freq = 128; 
    double
        delta = 0.0625;
    

    param_t p;
    p.declare("system_size",n);
    p.declare("temperature",T);
    p.declare("n_sample", sample);
    p.declare("sweep", sweep);
    p.declare("n_burnin", burnin);
    p.declare("n_timesteps", n_time);
    p.declare("n_Egrid", n_freq);
    p.declare("dt", delta);

    double gphase=0;
    p.declare("theta", gphase);

    //read the params
    p.from_file(argv[1]);


    // ... simulation code uses sample, sweep, burnin....


    std::string param_copy_path(argv[2]);
    param_copy_path.append("/params.inp");
    p.save_file(param_copy_path);
    return 0;
}

Pros

Very flexible
Can be quicker to get running if you know what you’re doing
Can implement data validation that errors if invalid data is fed in

Cons

Lots of ways to mess it up
Tradeoff between readability of the parser code and readability/robustness of the input file

JSON / YAML / TOML

Babies of Web 2.0, these standards are widely supported and expert at handling heterogeneous data. The syntax for YAML and TOML is a bit cleaner than JSON, but the three are otherwise essentially interchangeable. For a Rosetta Stone of sorts, see this deep dive.

Pros

Clean syntax
Handles heterogeneous data very naturally
Standard compliance means any decent text editor will highlight the syntax

Cons

Generally requires using a library, which can get annoying in compiled languages. That being said, several libraries exist to do this which are well documented and can be installed as header only packages.
- TOML++ does TOML, YAML and JSON, and has nice documentation
- nlohmann/json An older but more active project, with harder to navigate docs
- qc-json A newer JSON library, with more wide-ranging support. It’s header only, so can be copy/pasted right in to your project without any problematic CMake business.
No data validation by default, this also has to be done by hand

The Output File

Simulation output is unpredictable, and so it is generally speaking a bad idea to use a custom file format for this

much of your time will be wasted on maintaining your parser. You may choose to make this a single federated output that contains a complex set of data, or a collection of files grouped by a common root e.g. smallq_input_copy.inp, smallq_wavefunc, samllq_filefunc. In addition to the physical parameters needed, it’s best practice to save addition metadata with output:

What input parameters produced this output
A timestamp
What version of the code was running
(optional) How can the output be visualised

[2] and [3] are often overlooked, but they’re extremely important if you make changes to your code. (And if you’re doing anything nontrivial, you will be making changes to your code!)

If you go with the simple approach, i.e. managing metadata by copying the input file verbatim to the output directory (with some clear association between it and the output data, such as a common filename root), you can just prepend this extra data as a comment:

# input_copy.cff
# run 2022-04-10 17:33
# gs_find v1.14 'new Hermite solver'
system_size = 12
n_sample = 128
n_timesteps = 512
n_Egrid = 128
dt = 0.025
theta = 0.001

Just increment the version number every time you change a behaviour. The same goes for TOML, JSON or any other file format.

CSV

It is generally desirable to use common, human-readable data formats. .csv is the obvious choice - it’s trivial to implement basic CSV parsers in C++ (and not much harder in FORTRAN or pure C), and most linear algebra libraries (e.g. Eigen, armadillo) have built in functions for saving to csv.

This comes at a price - an ASCII representations of a double is not generally an exact mapping (though in practice, any error introduced here will be swamped by the error from more mundane things like roundoff) and is woefully space-inefficient: All of the precision available from a double renders into something like -1.0123456789012345e+1001, which is 24 bytes to store 8 bytes of data.

Another major downside is that your data has to be massaged into a 2D tabular format, severely restricting the physical applications this is suitable for.

For Model A, one could structure a file containing the observable $\langle \mathbb{r} _ 1 + \mathbb{r} _ 2 \rangle$

# 2022-04-02 14:44 GMT V=parabola.csv n=2 n_states=11
# Energy | \<psi|x1 + x2|psi\>
 -11.11229 | 0.298329 ...

Pros:

Portable
Simple to implement
Easily readable by anyone
Metadata can live in the file

Cons:

Slow IO
“Lossy” format
Limited to 2D data
Documentation must be done with a “hack” in a comment

Raw binary files

The C standard library function fwrite lets you just dump whatever complicated object is in ram onto the disk and call it a day.

Pros:

Really, really easy to implement
“Lossless” format

Cons

Need to know the structure of the object to read it from RAM, very easy to mix column-major ordering and row-major for arrays
od can only get you so far in interpreting a raw binary dump
Header information must be in a separate file
Platform dependent (some architectures represent integers “backwards”). This can be fatal if you intend to simulate on one machine (e.g. a cluster) and process on another

NumPy data format

There is a library for storing array data as .npy, the same format that NumPy saves its arrays in with a.save('data.npy').

Pros

Efficient
Portable
Readable by vanilla NumPy

Cons

Requires a library in C/C++/FORTRAN
No obvious way to add header information

HDF5

If your output is complicated, you may want to consider using HDF5, standing for Hierarchical Data Format. This is the native data type used by the Julia library JLD, the equivalent of Numpy’s far more limited .npy format.

A real use case of this is the following snippet, from my spinon GMFT code

using JLD
using HDF5

# ... other code ...

# physical constants associated with the simulation. 
struct SimulationParameters
    A::Matrix{Float64}
    Jpm::Float64
    B::Vec3_F64
    λ::Float64
    lat::geom.PyroFCC
    name::String
end

# Represents a path through the Brillouin zone.
struct BZPath
    t::Vector{Float64} # the "time" parameter, units of reciprocal length
    K::Vector{Vector{Float64}} # a list of points in K-space.
    ticks_t::Vector{Float64} # The "times" corresponding to high-symmetry points
    ticks_label::Vector{String} # The labels for the high symmetry points
end

# A 'hash function' providing an informative filenmae for simulation data
function sim_identifier(sim::SimulationParameters)
    return @sprintf("?name=%s?J_pm=%.3f?B=[%.3f,%.3f,%.3f]",
        sim.name,sim.Jpm,sim.B[1],sim.B[2],sim.B[3]) 
end

function save_spinons(output_dir::String;
        bands::Matrix{Float64},
        BZ_path::BZPath,
        sim::SimulationParameters, 
        prefix="spinons")

    name = output_dir*"/"*prefix*sim_identifier(sim)*".jld"
    jldopen(name, "w") do file
        
        # stores a copy of the input
        g = create_group(file, "physical_parameters")
        g["name"] = sim.name
        g["fluxes"] = sim.A
        g["Jpm"] = sim.Jpm
        g["B"] = sim.B
        g["lambda"] = sim.λ
        g["L"]=sim.lat.L

        # stores the output 
        d = create_group(file, "spinon_dispersion") 
        d["bands"] = bands
        # a list of K points, such that the I'th S slics corresponds to the I'th K point
        d["Q_list"] = BZ_path.K 
        d["tau"] = BZ_path.t
        d["ticks_tau"] = BZ_path.ticks_t
        d["ticks_label"] = BZ_path.ticks_label
    end
    return name
end

A “group” in an HDF5 file is essentially a folder, allowing for neat organisation of complex data.

Pros

Very flexible
The industry standard scientific data format
Platform independent
Efficient data IO
Compatible with parallel access
Capable of storing any kind of heterogeneous data in one spot Cons
The generality of the format can make the library functions complicated to use
Requires specialised software to read it
Slightly annoying to use with NumPy (see the h5py module).

The Bottom Line

As of 2024, my preferred general-purpose solution is to use something human-readable (e.g. TOML) for input data, and HDF5 for output. Parallel access in HDF5 is a powerful feature, but That being said, there is always a trade-off between complexity and readability - if you are not an experienced programmer, some of these library interfaces can appear to be more trouble than they’re worth.