One of the most difficult challenges faced by a computational physicist is designing your software in a user-friendly way. Most masters (or PhD, for that matter) students will present any code they have written accompanied by a profuse set of apologies for the mess that the receiver is inheriting, but this is a problem for several scientific reasons -
- Good science should be independently verifiable, so there is a moral obligation to make code accessible for others to read, understand and run to check your work.
- Good code is easier to modify- investing a few weeks into setting the problem up robustly can translate to massive time savings later, particularly if someone else starts maintaining your code
- Good practices make you more confident that the result you have is correct: Making code modular allows you to test different subsections independently, rooting out bugs before they cause weird behaviour somewhere else.
Living up to these lofty ideals is not so easy.
- By definition, you do not know what the final form of research code will look like when you start writing it.
- Scientists usually work individually, meaning that there are are no ongoing checks protecting against bad practice or rewarding good ones.
- Writing good code is harder than writing bad code.
- Every research problem is unique, making it hard to modularise your code.
- It requires substantial experience to know which paradigms from the wider world of programming are helpful.
This post is the beginning of a series, in which I want to present my own perspective on how to tackle the non-physics parts of computational physics - which language to choose, how to structure folders, how to structure your project in such a way that it is easily comprehensible to yourself and others. A tentative plan is the following:
- I Directory structure and input files
- II Choice of language
- III My way of amalgamating output files in a sensible way
- IV Plotting code
The Toy Models
For concreteness, I will consider two different physics problems to illustrate the different approaches that can be used -
Model A: A deterministic Schrödinger equation solver. ‘Exactly’ diagonalises the Hamiltonian for n
interacting particle in an arbitrary 1D potential well V(x).
Inputs:
V(x)
: The potential well.n
: The number of particles.n_states
: The number of eigenvalues we want to find (0 = ground state only)
Outputs:
- Energy/ies
- Wavefunction/s
Model B: Spin-Wave Theory. Find the classical ground state of a lattice, expand the ground state in Holstein-Primakoff bosons, truncate to next-leading order and diagonalise the Hamiltonian. Then predict the spin correlators seen by neutron scattering.
Inputs:
Lattice
, describing the sites and connectivity of the magnetic sitescorrtype
, specifying which correlators to probe (the relevant one depends on the kind of spins being investigated)BZpath
, the path through the Brillouin zone to plotB
, the external, applied magnetic field
Outputs:
- Magnon dispersion along a specific curve in $k$ space
- Spectral weight as seen in a single crystal scattering experiment
- Spectral weight as seen by a powder diffraction experiment
The Zen of Physics Code
Before I met Jeanine, my life was cosmically a shambles, it was ah… I was using bits and pieces of whatever Eastern philosophies happened to drift through my transom and she sort of sorted it out for me, straightened it out for me, gave me a path, you know, a path to follow.
- David St. Hubbins, This is Spinal Tap
- Make it obvious where data comes from.
- Make it obvious what programs do.
- Programs should do one thing and do it well.
- At most three shell commands should be involved in transforming parameters into a publishable graph.
- Label everything.
The general idea here is to write executables and datafiles that can be read as verbs and nouns that describe the whole calculation:
[ Find the two-body Hamiltonian ] of ( n electrons ) and [ diagonalise it ]
[ Find the classical ground state ] of ( spins on a lattice ),
[ expand to second order in Holstein-Primakoff bosons ] and
[ find the corresponding spin correlators ],
then [ average over them with Monte Carlo ]
Each item in [ brackets ]
can be interpreted as a verb, and arguably deserves its own executable, and the item in ( parentheses )
should have an input file associated with it. Obviously there’s a judgement call to make here when
you decide how to
modularise everything - if the code is fast, it can be more appropriate to have a ‘god verb’ that
does everything. These guidelines are intended for a fairly specific use case, a complicated,
multi-step calculation that technically only needs to run once.
Less abstractly, these maxims translate to
- Every datafile in your output should have the input that made it and the context to interpret it saved with it. If you are producing a 2D ‘heatmap’ output, you should save the 1D X axis and Y axis information in the same place that you saved the Z axis matrix.
- Write your ‘verbs’ in such a way that they can be used without reading the source code. Instead of crashing when you use an incorrect call sequence, gently remind the user how it’s meant to be called with a message. Use at most 4 command line parameters: any more, and they should be in a file.
- Your code does not have to fit into one executable. If your simulation involves precalculating
some parameters then running a Monte Carlo simulation, you should have two executables
calc_params
andrun_sim
, where the latter is fed output from the former. The intermediate data should be saved to a file (with metadata), as per [1]. - Your project should have at most three levers - “run simulation”, “process data” and “plot data”. (A good way to achieve this is with shell or python scripts).
- Magic numbers - in the source code, or in the in/out files, are not allowed.
Directory structure
This is more art than science, but if you’re writing your first big project it can seem an insurmountable obstacle to even establish where to start.
The thing to have in mind is this: For science code, the main tasks you’ll be doing are
- Writing the simulation
- Creating new simulations with slightly different parameters
- Plotting the output data in new, interesting ways
The goal now is to minimise the hassle that these tasks cause.
Example: Model A
mdA + .gitignore
+ Makefile
+ README.md
+ _driver ----- + simulate.sh
| + process.sh
+ _process ---- + plot_wavefunc.py
| + plot_momentum.py
| + correlate.ipynb
| + img --------- + wavefun_onepart.pdf
| + wavefun_twopart.pdf
+ bin --------- + simwavfunc
| + simspinfunc
| + spinwavfunc_field
+ src --------- + sim.h
| + sim.c
| + ...
+ input ----- + one_part.inp
| + two_part.inp
| + potentials -- + harmonic.csv
| + quartic.csv
| + free.csv
+ data ------ + one_part ---- + _sim_params.inp
| + _potential.csv
| + rcorr.csv
| + pcorr.csv
| + wavefunc.csv
| + energy.csv
+ two_part ---- + _sim_params.inp
+ _potential.csv
+ rcorr.csv
+ pcorr.csv
+ wavefunc.csv
+ energy.csv
I’m using underscores to push the most important things to the top of ls
output.
Creating and running a simulation might then look like
cd mdA
cp input/two_part.inp input/three_part.inp
vi input/three_part.inp # edit the input parameters...
bin/simwavfunc input/three_part.inp data/three_part
# wait until finished...
python3 _process/plot_wavefunc.py data/three_part
The bash scripts simulate.sh
and plot.sh
are the last things to be
added before publishing. They contain, respectively
#!/bin/bash
# simulate.sh
mkdir data
for sim in `ls input`; do
# remove the extension to get the folder name
s= "${sim%.*}"
mkdir "data/$s"
bin/simwavfunc "input/$sim" "data/$s"
done
#!/bin/bash
# process.sh
mkdir _process/img
for dir in `ls data`; do
python3 _process/plot_wavefunc.py "data/$dir"
done
These are the two levers of maxim [4].
Note _sim_params.inp
in the output folders. These are copies of one_part.inp
and two_part.inp
, which
are produced by the simulation executable. This is helpful for many reasons:
- In case you change any of the input data while your code is running (very easy to do if it spans multiple days)
- To keep metadata with data (maxim [1])
- To reassure you that the program is reading the parameters correctly.
- To keep all the data that a post-processing script needs in one place.
Some extra benefits of this structure are:
- The programs involved never have to traverse system directories. This is helpful for code readability and cross-platform compatibility.
- The data folder could be called anything and located anywhere (even on a networked drive!)
- We don’t need any fancy programs or file parsers to read everything that happens.
If necessary, judicious use of tar
can compress the data files if you need to transfer them
between different computers. (Cheat sheet: tar -scvf MyDir.tar.gz MyDir
to compress, tar-xvf
MyDir.tar.gz
to unzip. It’s unlikely that you would ever need to do anything else with tar
.)
Example: Spin Wave theory
The requirements here are a bit different. One of our inputs is a 3D lattice, which has complex
structure that can’t be represented in a simple CSV / .inp
file. In this case, it’s more
appropriate to use a “real” format (e.g. TOML) to store the complicated input data (see below for
details).
mdB + .gitignore
+ Makefile
+ README.md
+ _process ---- + _fit_cote_0p05.py
| + _fit_cote_1p70.py
| + fitting.ipynb
| + correlate.ipynb
| + spinW.py
+ fig --------- + powder_honeycomb_realistic.pdf
| + dispersion_honeycomb_realistic.pdf
| + ...
+ bin --------- + spinwave
| + paverage
+ src --------- + sim.h
| + sim.c
| + ...
+ experimental + CoTe_sqw_0p05K.csv
| + CoTe_sqw_1p70K.csv
| + ...
+ lattice ----- + honeycomb.toml
+ cubic.toml
+ pyrochlore.toml
This one’s a bit different. The use case here is fitting experimental data using the spinwave
toolchain - the .ipynb
notebooks function for experimentation, but the final data is produced by
the python scripts _fit_cote_0p05.py,_fit_cote_1p70.py
. The compiled binaries are called by the
library spinW.py
.
The input file
It’s really tempting to write input files quick and dirty. For example:
# input.in
# first run, single particle test in a harmonic trap
1 # number of particles
11 # number of sites
25 16 9 4 1 0 1 4 9 16 25 # potential (must have n = number of sites)
1e-4 # interaction strength
or even more mystically
1
11
25 16 9 4 1 0 1 4 9 16 25
1e-4
The last two structures are “simple” to parse in a C-like language -
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_LINE_LEN 1024
void parse_file(
const char* fname,
uint16_t* nparts,
uint16_t* nsites,
double** V,
double* istrength
) {
FILE* fp = fopen(fname, "r");
char linebuf[MAX_LINE_LEN];
// read first line
fgets(linebuf, MAX_LINE_LEN, fp);
*nparts = atoi(linebuf);
// read second line
fgets(linebuf, MAX_LINE_LEN, fp);
*nsites = atoi(linebuf);
*V = (double *) malloc(sizeof(double)* (*nsites));
// read third line V into memory
fgets(linebuf, MAX_LINE_LEN, fp);
// assume space delimiters
char *token = strtok(linebuf, " ");
uint16_t idx=0;
while( token != NULL ) {
(*V)[idx] = atof(token);
token = strtok(NULL, " ");
idx++;
if (idx >= *nsites) break; // make sure we don't overrun!
}
// read fourth line
fgets(linebuf, MAX_LINE_LEN, fp);
*istrength = atof(linebuf);
}
From a scientific perspective, the code is very dangerous. Suppose you miscount and write
1
15
25 16 9 4 1 0 1 4 9 16 25
1e-4
Then the program will silently allocate more space than is needed for V
, and will set them to
zero (if the OS is being polite), or even worse, to whatever crap was lying in the ram before your program got hold of
it!
If we forget the line order in the following way,
1
1e-4
25 16 9 4 1 0 1 4 9 16 25
11
It’s not clear what atoi
will do now, but it definitely isn’t the right thing, and there will
again be no warnings about this.
#define MAX_LINE_LEN 1024
might also become a problem if we have a somewhat large potential -
that’s only enough to fit ~40 plaintext doubles with full precision.
None of these issues are unfixable, it’s just a massive time-sink trying to sort these annoying data validation issues out. I propose a few ideas for alternatives below.
The worst option for handling input is using unlabeled command line arguments.
./sim 1 11 25 16 9 4 1 0 1 4 9 16 25 1e-4
This is completely opaque, and is a flagrant violation of [1] and [2]. In my opinion, positional command line arguments should never be used to specify physical parameters - it’s just too easy to forget which place means what.
Rolling your own parser code
This can be defensible if you only have a small number of low-complexity inputs. There’s something to be said for avoiding dependencies on libraries. Some guidelines for doing this though:
- Write the input file before you write the parser.
- Use natural-feeling delimiters like
=
,:
- Do it once, do it right, never worry about it again.
My own homebrew parser is on GitHub if you’re interested. Its interface works like this:
#include <parameter.hh>
// Use template magic to declare what types our numeric parameters are
typedef parameters<int, unsigned, double> param_t;
int main (int argc, char** argv) {
// ---------- Reading parameters ----------
// Required parameters
if (argc < 4){
fprintf(stderr, "Required parameters missing\n");
return EXIT_FAILURE;
}
int n;
double T;
// Optional parameter variables
unsigned
sample = 2048,
sweep = 4,
burnin = 128,
n_time = 512,
n_freq = 128;
double
delta = 0.0625;
param_t p;
p.declare("system_size",n);
p.declare("temperature",T);
p.declare("n_sample", sample);
p.declare("sweep", sweep);
p.declare("n_burnin", burnin);
p.declare("n_timesteps", n_time);
p.declare("n_Egrid", n_freq);
p.declare("dt", delta);
double gphase=0;
p.declare("theta", gphase);
//read the params
p.from_file(argv[1]);
// ... simulation code uses sample, sweep, burnin....
std::string param_copy_path(argv[2]);
param_copy_path.append("/params.inp");
p.save_file(param_copy_path);
return 0;
}
Pros
- Very flexible
- Can be quicker to get running if you know what you’re doing
- Can implement data validation that errors if invalid data is fed in
Cons
- Lots of ways to mess it up
- Tradeoff between readability of the parser code and readability/robustness of the input file
JSON / YAML / TOML
Babies of Web 2.0, these standards are widely supported and expert at handling heterogeneous data. The syntax for YAML and TOML is a bit cleaner than JSON, but the three are otherwise essentially interchangeable. For a Rosetta Stone of sorts, see this deep dive.
Pros
- Clean syntax
- Handles heterogeneous data very naturally
- Standard compliance means any decent text editor will highlight the syntax
Cons
- Generally requires using a library, which can get annoying in
compiled languages. That being said, several libraries exist to do this which are well documented and can be
installed as header only packages.
- TOML++ does TOML, YAML and JSON, and has nice documentation
- nlohmann/json An older but more active project, with harder to navigate docs
- qc-json A newer JSON library, with more wide-ranging support. It’s header only, so can be copy/pasted right in to your project without any problematic CMake business.
- No data validation by default, this also has to be done by hand
The Output File
Simulation output is unpredictable, and so it is generally speaking a bad idea to use a custom file format for this
- much of your time will be wasted on maintaining your parser. You may choose to make this a single federated output
that contains a complex set of data, or a collection of files
grouped by a common root e.g.
smallq_input_copy.inp, smallq_wavefunc, samllq_filefunc
. In addition to the physical parameters needed, it’s best practice to save addition metadata with output:
- What input parameters produced this output
- A timestamp
- What version of the code was running
- (optional) How can the output be visualised
[2] and [3] are often overlooked, but they’re extremely important if you make changes to your code. (And if you’re doing anything nontrivial, you will be making changes to your code!)
If you go with the simple approach, i.e. managing metadata by copying the input file verbatim to the output directory (with some clear association between it and the output data, such as a common filename root), you can just prepend this extra data as a comment:
# input_copy.cff
# run 2022-04-10 17:33
# gs_find v1.14 'new Hermite solver'
system_size = 12
n_sample = 128
n_timesteps = 512
n_Egrid = 128
dt = 0.025
theta = 0.001
Just increment the version number every time you change a behaviour. The same goes for TOML, JSON or any other file format.
CSV
It is generally desirable to use common, human-readable data formats.
.csv
is the obvious choice - it’s trivial to implement basic CSV parsers in C++ (and
not much harder in FORTRAN or pure C), and most linear algebra libraries (e.g. Eigen, armadillo)
have built in functions for saving to csv.
This comes at a price - an ASCII representations of a double is not generally an exact mapping (though in practice, any error introduced here will be swamped by the error from more mundane things like roundoff) and is woefully space-inefficient: All of the precision available from a double renders into something like -1.0123456789012345e+1001
, which is 24 bytes to store 8 bytes of data.
Another major downside is that your data has to be massaged into a 2D tabular format, severely restricting the physical applications this is suitable for.
For Model A, one could structure a file containing the observable $\langle \mathbb{r} _ 1 + \mathbb{r} _ 2 \rangle$
# 2022-04-02 14:44 GMT V=parabola.csv n=2 n_states=11
# Energy | \<psi|x1 + x2|psi\>
-11.11229 | 0.298329 ...
Pros:
- Portable
- Simple to implement
- Easily readable by anyone
- Metadata can live in the file
Cons:
- Slow IO
- “Lossy” format
- Limited to 2D data
- Documentation must be done with a “hack” in a comment
Raw binary files
The C standard library function fwrite
lets you just dump whatever complicated object is in ram onto the disk and call it a day.
Pros:
- Really, really easy to implement
- “Lossless” format
Cons
- Need to know the structure of the object to read it from RAM, very easy to mix column-major ordering and row-major for arrays
od
can only get you so far in interpreting a raw binary dump- Header information must be in a separate file
- Platform dependent (some architectures represent integers “backwards”). This can be fatal if you intend to simulate on one machine (e.g. a cluster) and process on another
NumPy data format
There is a library for storing array data as .npy
, the same format that NumPy saves its arrays in
with a.save('data.npy')
.
Pros
- Efficient
- Portable
- Readable by vanilla NumPy
Cons
- Requires a library in C/C++/FORTRAN
- No obvious way to add header information
HDF5
If your output is complicated, you may want to consider using
HDF5, standing for
Hierarchical Data Format. This is the native data type used by the Julia library
JLD, the equivalent
of Numpy’s far more limited .npy
format.
A real use case of this is the following snippet, from my spinon GMFT code
using JLD
using HDF5
# ... other code ...
# physical constants associated with the simulation.
struct SimulationParameters
A::Matrix{Float64}
Jpm::Float64
B::Vec3_F64
λ::Float64
lat::geom.PyroFCC
name::String
end
# Represents a path through the Brillouin zone.
struct BZPath
t::Vector{Float64} # the "time" parameter, units of reciprocal length
K::Vector{Vector{Float64}} # a list of points in K-space.
ticks_t::Vector{Float64} # The "times" corresponding to high-symmetry points
ticks_label::Vector{String} # The labels for the high symmetry points
end
# A 'hash function' providing an informative filenmae for simulation data
function sim_identifier(sim::SimulationParameters)
return @sprintf("?name=%s?J_pm=%.3f?B=[%.3f,%.3f,%.3f]",
sim.name,sim.Jpm,sim.B[1],sim.B[2],sim.B[3])
end
function save_spinons(output_dir::String;
bands::Matrix{Float64},
BZ_path::BZPath,
sim::SimulationParameters,
prefix="spinons")
name = output_dir*"/"*prefix*sim_identifier(sim)*".jld"
jldopen(name, "w") do file
# stores a copy of the input
g = create_group(file, "physical_parameters")
g["name"] = sim.name
g["fluxes"] = sim.A
g["Jpm"] = sim.Jpm
g["B"] = sim.B
g["lambda"] = sim.λ
g["L"]=sim.lat.L
# stores the output
d = create_group(file, "spinon_dispersion")
d["bands"] = bands
# a list of K points, such that the I'th S slics corresponds to the I'th K point
d["Q_list"] = BZ_path.K
d["tau"] = BZ_path.t
d["ticks_tau"] = BZ_path.ticks_t
d["ticks_label"] = BZ_path.ticks_label
end
return name
end
A “group” in an HDF5 file is essentially a folder, allowing for neat organisation of complex data.
Pros
- Very flexible
- The
industry standard
scientific data format - Platform independent
- Efficient data IO
- Compatible with parallel access
- Capable of storing any kind of heterogeneous data in one spot Cons
- The generality of the format can make the library functions complicated to use
- Requires specialised software to read it
- Slightly annoying to use with NumPy (see the
h5py
module).
The Bottom Line
As of 2024, my preferred general-purpose solution is to use something human-readable (e.g. TOML) for input data, and HDF5 for output. Parallel access in HDF5 is a powerful feature, but That being said, there is always a trade-off between complexity and readability - if you are not an experienced programmer, some of these library interfaces can appear to be more trouble than they’re worth.