Computational Science Cookbook I: Lord of the Files
[ code , comp-sci ]

One of the most difficult challenges faced by a computational physicist is designing your software in a user-friendly way. Most masters (or PhD, for that matter) students will present any code they have written accompanied by a profuse set of apologies for the mess that the receiver is inheriting, but this is a problem for several scientific reasons -

  1. Good science should be independently verifiable, so there is a moral obligation to make code accessible for others to read, understand and run to check your work.
  2. Good code is easier to modify- investing a few weeks into setting the problem up robustly can translate to massive time savings later, particularly if someone else starts maintaining your code
  3. Good practices make you more confident that the result you have is correct: Making code modular allows you to test different subsections independently, rooting out bugs before they cause weird behaviour somewhere else.

Living up to these lofty ideals is not so easy.

  1. By definition, you do not know what the final form of research code will look like when you start writing it.
  2. Scientists usually work individually, meaning that there are are no ongoing checks protecting against bad practice or rewarding good ones.
  3. Writing good code is harder than writing bad code.
  4. Every research problem is unique, making it hard to modularise your code.
  5. It requires substantial experience to know which paradigms from the wider world of programming are helpful.

This post is the beginning of a series, in which I want to present my own perspective on how to tackle the non-physics parts of computational physics - which language to choose, how to structure folders, how to structure your project in such a way that it is easily comprehensible to yourself and others. A tentative plan is the following:

The Toy Models

For concreteness, I will consider two different physics problems to illustrate the different approaches that can be used -

Model A: A deterministic Schrödinger equation solver. ‘Exactly’ diagonalises the Hamiltonian for n interacting particle in an arbitrary 1D potential well V(x).

Inputs:

Outputs:

Model B: Spin-Wave Theory. Find the classical ground state of a lattice, expand the ground state in Holstein-Primakoff bosons, truncate to next-leading order and diagonalise the Hamiltonian. Then predict the spin correlators seen by neutron scattering.

Inputs:

Outputs:

The Zen of Physics Code

Before I met Jeanine, my life was cosmically a shambles, it was ah… I was using bits and pieces of whatever Eastern philosophies happened to drift through my transom and she sort of sorted it out for me, straightened it out for me, gave me a path, you know, a path to follow.

- David St. Hubbins, This is Spinal Tap

  1. Make it obvious where data comes from.
  2. Make it obvious what programs do.
  3. Programs should do one thing and do it well.
  4. At most three shell commands should be involved in transforming parameters into a publishable graph.
  5. Label everything.

The general idea here is to write executables and datafiles that can be read as verbs and nouns that describe the whole calculation:

[ Find the two-body Hamiltonian ] of ( n electrons ) and [ diagonalise it ]
[ Find the classical ground state ] of ( spins on a lattice ), 
[ expand to second order in Holstein-Primakoff bosons ] and
[ find the corresponding spin correlators ], 
then [ average over them with Monte Carlo ]

Each item in [ brackets ] can be interpreted as a verb, and arguably deserves its own executable, and the item in ( parentheses ) should have an input file associated with it. Obviously there’s a judgement call to make here when you decide how to modularise everything - if the code is fast, it can be more appropriate to have a ‘god verb’ that does everything. These guidelines are intended for a fairly specific use case, a complicated, multi-step calculation that technically only needs to run once.

Less abstractly, these maxims translate to

  1. Every datafile in your output should have the input that made it and the context to interpret it saved with it. If you are producing a 2D ‘heatmap’ output, you should save the 1D X axis and Y axis information in the same place that you saved the Z axis matrix.
  2. Write your ‘verbs’ in such a way that they can be used without reading the source code. Instead of crashing when you use an incorrect call sequence, gently remind the user how it’s meant to be called with a message. Use at most 4 command line parameters: any more, and they should be in a file.
  3. Your code does not have to fit into one executable. If your simulation involves precalculating some parameters then running a Monte Carlo simulation, you should have two executables calc_params and run_sim, where the latter is fed output from the former. The intermediate data should be saved to a file (with metadata), as per [1].
  4. Your project should have at most three levers - “run simulation”, “process data” and “plot data”. (A good way to achieve this is with shell or python scripts).
  5. Magic numbers - in the source code, or in the in/out files, are not allowed.

Directory structure

This is more art than science, but if you’re writing your first big project it can seem an insurmountable obstacle to even establish where to start.

The thing to have in mind is this: For science code, the main tasks you’ll be doing are

  1. Writing the simulation
  2. Creating new simulations with slightly different parameters
  3. Plotting the output data in new, interesting ways

The goal now is to minimise the hassle that these tasks cause.

Example: Model A

mdA + .gitignore
    + Makefile
    + README.md
    + _driver ----- + simulate.sh 
    |               + process.sh
    + _process ---- + plot_wavefunc.py
    |               + plot_momentum.py
    |               + correlate.ipynb
    |               + img --------- + wavefun_onepart.pdf
    |                               + wavefun_twopart.pdf
    + bin --------- + simwavfunc
    |               + simspinfunc
    |               + spinwavfunc_field
    + src --------- + sim.h
    |               + sim.c
    |               + ...
    + input -----   + one_part.inp
    |               + two_part.inp
    |               + potentials -- + harmonic.csv
    |                               + quartic.csv
    |                               + free.csv
    + data   ------ + one_part ---- + _sim_params.inp
                    |               + _potential.csv
                    |               + rcorr.csv
                    |               + pcorr.csv
                    |               + wavefunc.csv
                    |               + energy.csv
                    + two_part ---- + _sim_params.inp
                                    + _potential.csv
                                    + rcorr.csv    
                                    + pcorr.csv
                                    + wavefunc.csv
                                    + energy.csv

I’m using underscores to push the most important things to the top of ls output. Creating and running a simulation might then look like

cd mdA
cp input/two_part.inp input/three_part.inp
vi input/three_part.inp # edit the input parameters...
bin/simwavfunc input/three_part.inp data/three_part
# wait until finished...
python3 _process/plot_wavefunc.py data/three_part

The bash scripts simulate.sh and plot.sh are the last things to be added before publishing. They contain, respectively

#!/bin/bash
# simulate.sh
mkdir data

for sim in `ls input`; do
    # remove the extension to get the folder name
    s= "${sim%.*}"
    mkdir "data/$s"
    bin/simwavfunc "input/$sim" "data/$s"
done
#!/bin/bash
# process.sh
mkdir _process/img
for dir in `ls data`; do
    python3 _process/plot_wavefunc.py "data/$dir"
done

These are the two levers of maxim [4].

Note _sim_params.inp in the output folders. These are copies of one_part.inp and two_part.inp, which are produced by the simulation executable. This is helpful for many reasons:

Some extra benefits of this structure are:

If necessary, judicious use of tar can compress the data files if you need to transfer them between different computers. (Cheat sheet: tar -scvf MyDir.tar.gz MyDir to compress, tar-xvf MyDir.tar.gz to unzip. It’s unlikely that you would ever need to do anything else with tar.)

Example: Spin Wave theory

The requirements here are a bit different. One of our inputs is a 3D lattice, which has complex structure that can’t be represented in a simple CSV / .inp file. In this case, it’s more appropriate to use a “real” format (e.g. TOML) to store the complicated input data (see below for details).

mdB + .gitignore
    + Makefile
    + README.md
    + _process ---- + _fit_cote_0p05.py
    |               + _fit_cote_1p70.py
    |               + fitting.ipynb
    |               + correlate.ipynb
    |               + spinW.py
    + fig --------- + powder_honeycomb_realistic.pdf
    |               + dispersion_honeycomb_realistic.pdf
    |               + ...
    + bin --------- + spinwave
    |               + paverage
    + src --------- + sim.h
    |               + sim.c
    |               + ...
    + experimental  + CoTe_sqw_0p05K.csv
    |               + CoTe_sqw_1p70K.csv
    |               + ...
    + lattice ----- + honeycomb.toml
                    + cubic.toml
                    + pyrochlore.toml

This one’s a bit different. The use case here is fitting experimental data using the spinwave toolchain - the .ipynb notebooks function for experimentation, but the final data is produced by the python scripts _fit_cote_0p05.py,_fit_cote_1p70.py. The compiled binaries are called by the library spinW.py.

The input file

It’s really tempting to write input files quick and dirty. For example:

# input.in
# first run, single particle test in a harmonic trap
1 # number of particles
11 # number of sites
25 16 9 4 1 0 1 4 9 16 25 # potential (must have n = number of sites)
1e-4 # interaction strength

or even more mystically

1
11
25 16 9 4 1 0 1 4 9 16 25
1e-4

The last two structures are “simple” to parse in a C-like language -


#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_LINE_LEN 1024

void parse_file(
    const char* fname,
    uint16_t* nparts,
    uint16_t* nsites,
    double** V,
    double* istrength
    ) {

    FILE* fp = fopen(fname, "r");
    char linebuf[MAX_LINE_LEN];
    // read first line
    fgets(linebuf, MAX_LINE_LEN, fp);
    *nparts = atoi(linebuf);
    // read second line
    fgets(linebuf, MAX_LINE_LEN, fp);
    *nsites = atoi(linebuf);

    *V = (double *) malloc(sizeof(double)* (*nsites));
    // read third line V into memory
    fgets(linebuf, MAX_LINE_LEN, fp);
    // assume space delimiters
    char *token = strtok(linebuf, " ");

    uint16_t idx=0;
    while( token != NULL ) {
        (*V)[idx] = atof(token);
        token = strtok(NULL, " ");
        idx++;
        if (idx >= *nsites) break; // make sure we don't overrun!
    }

    // read fourth line
    fgets(linebuf, MAX_LINE_LEN, fp);
    *istrength = atof(linebuf);
}

From a scientific perspective, the code is very dangerous. Suppose you miscount and write

1
15
25 16 9 4 1 0 1 4 9 16 25
1e-4

Then the program will silently allocate more space than is needed for V, and will set them to zero (if the OS is being polite), or even worse, to whatever crap was lying in the ram before your program got hold of it!

If we forget the line order in the following way,

1
1e-4
25 16 9 4 1 0 1 4 9 16 25
11

It’s not clear what atoi will do now, but it definitely isn’t the right thing, and there will again be no warnings about this.

#define MAX_LINE_LEN 1024 might also become a problem if we have a somewhat large potential - that’s only enough to fit ~40 plaintext doubles with full precision.

None of these issues are unfixable, it’s just a massive time-sink trying to sort these annoying data validation issues out. I propose a few ideas for alternatives below.

The worst option for handling input is using unlabeled command line arguments.

./sim 1 11 25 16 9 4 1 0 1 4 9 16 25 1e-4

This is completely opaque, and is a flagrant violation of [1] and [2]. In my opinion, positional command line arguments should never be used to specify physical parameters - it’s just too easy to forget which place means what.

Rolling your own parser code

This can be defensible if you only have a small number of low-complexity inputs. There’s something to be said for avoiding dependencies on libraries. Some guidelines for doing this though:

My own homebrew parser is on GitHub if you’re interested. Its interface works like this:

#include <parameter.hh>

// Use template magic to declare what types our numeric parameters are
typedef parameters<int, unsigned, double> param_t;

int main (int argc, char** argv) {
    // ---------- Reading parameters ----------
    // Required parameters
    if (argc < 4){
        fprintf(stderr, "Required parameters missing\n");
        return EXIT_FAILURE;
    }

    int n;
    double T;
    
    // Optional parameter variables
    unsigned
        sample = 2048,
        sweep  = 4,
        burnin = 128,        
        n_time = 512, 
        n_freq = 128; 
    double
        delta = 0.0625;
    

    param_t p;
    p.declare("system_size",n);
    p.declare("temperature",T);
    p.declare("n_sample", sample);
    p.declare("sweep", sweep);
    p.declare("n_burnin", burnin);
    p.declare("n_timesteps", n_time);
    p.declare("n_Egrid", n_freq);
    p.declare("dt", delta);

    double gphase=0;
    p.declare("theta", gphase);

    //read the params
    p.from_file(argv[1]);


    // ... simulation code uses sample, sweep, burnin....


    std::string param_copy_path(argv[2]);
    param_copy_path.append("/params.inp");
    p.save_file(param_copy_path);
    return 0;
}

Pros

Cons

JSON / YAML / TOML

Babies of Web 2.0, these standards are widely supported and expert at handling heterogeneous data. The syntax for YAML and TOML is a bit cleaner than JSON, but the three are otherwise essentially interchangeable. For a Rosetta Stone of sorts, see this deep dive.

Pros

Cons

The Output File

Simulation output is unpredictable, and so it is generally speaking a bad idea to use a custom file format for this

  1. What input parameters produced this output
  2. A timestamp
  3. What version of the code was running
  4. (optional) How can the output be visualised

[2] and [3] are often overlooked, but they’re extremely important if you make changes to your code. (And if you’re doing anything nontrivial, you will be making changes to your code!)

If you go with the simple approach, i.e. managing metadata by copying the input file verbatim to the output directory (with some clear association between it and the output data, such as a common filename root), you can just prepend this extra data as a comment:

# input_copy.cff
# run 2022-04-10 17:33
# gs_find v1.14 'new Hermite solver'
system_size = 12
n_sample = 128
n_timesteps = 512
n_Egrid = 128
dt = 0.025
theta = 0.001

Just increment the version number every time you change a behaviour. The same goes for TOML, JSON or any other file format.

CSV

It is generally desirable to use common, human-readable data formats. .csv is the obvious choice - it’s trivial to implement basic CSV parsers in C++ (and not much harder in FORTRAN or pure C), and most linear algebra libraries (e.g. Eigen, armadillo) have built in functions for saving to csv.

This comes at a price - an ASCII representations of a double is not generally an exact mapping (though in practice, any error introduced here will be swamped by the error from more mundane things like roundoff) and is woefully space-inefficient: All of the precision available from a double renders into something like -1.0123456789012345e+1001, which is 24 bytes to store 8 bytes of data.

Another major downside is that your data has to be massaged into a 2D tabular format, severely restricting the physical applications this is suitable for.

For Model A, one could structure a file containing the observable $\langle \mathbb{r} _ 1 + \mathbb{r} _ 2 \rangle$

# 2022-04-02 14:44 GMT V=parabola.csv n=2 n_states=11
# Energy | \<psi|x1 + x2|psi\>
 -11.11229 | 0.298329 ...

Pros:

Cons:

Raw binary files

The C standard library function fwrite lets you just dump whatever complicated object is in ram onto the disk and call it a day.

Pros:

Cons

NumPy data format

There is a library for storing array data as .npy, the same format that NumPy saves its arrays in with a.save('data.npy').

Pros

Cons

HDF5

If your output is complicated, you may want to consider using HDF5, standing for Hierarchical Data Format. This is the native data type used by the Julia library JLD, the equivalent of Numpy’s far more limited .npy format.

A real use case of this is the following snippet, from my spinon GMFT code

using JLD
using HDF5

# ... other code ...

# physical constants associated with the simulation. 
struct SimulationParameters
    A::Matrix{Float64}
    Jpm::Float64
    B::Vec3_F64
    λ::Float64
    lat::geom.PyroFCC
    name::String
end

# Represents a path through the Brillouin zone.
struct BZPath
    t::Vector{Float64} # the "time" parameter, units of reciprocal length
    K::Vector{Vector{Float64}} # a list of points in K-space.
    ticks_t::Vector{Float64} # The "times" corresponding to high-symmetry points
    ticks_label::Vector{String} # The labels for the high symmetry points
end

# A 'hash function' providing an informative filenmae for simulation data
function sim_identifier(sim::SimulationParameters)
    return @sprintf("?name=%s?J_pm=%.3f?B=[%.3f,%.3f,%.3f]",
        sim.name,sim.Jpm,sim.B[1],sim.B[2],sim.B[3]) 
end

function save_spinons(output_dir::String;
        bands::Matrix{Float64},
        BZ_path::BZPath,
        sim::SimulationParameters, 
        prefix="spinons")

    name = output_dir*"/"*prefix*sim_identifier(sim)*".jld"
    jldopen(name, "w") do file
        
        # stores a copy of the input
        g = create_group(file, "physical_parameters")
        g["name"] = sim.name
        g["fluxes"] = sim.A
        g["Jpm"] = sim.Jpm
        g["B"] = sim.B
        g["lambda"] = sim.λ
        g["L"]=sim.lat.L

        # stores the output 
        d = create_group(file, "spinon_dispersion") 
        d["bands"] = bands
        # a list of K points, such that the I'th S slics corresponds to the I'th K point
        d["Q_list"] = BZ_path.K 
        d["tau"] = BZ_path.t
        d["ticks_tau"] = BZ_path.ticks_t
        d["ticks_label"] = BZ_path.ticks_label
    end
    return name
end

A “group” in an HDF5 file is essentially a folder, allowing for neat organisation of complex data.

Pros

The Bottom Line

As of 2024, my preferred general-purpose solution is to use something human-readable (e.g. TOML) for input data, and HDF5 for output. Parallel access in HDF5 is a powerful feature, but That being said, there is always a trade-off between complexity and readability - if you are not an experienced programmer, some of these library interfaces can appear to be more trouble than they’re worth.