If you’ve read the yet-to-be-published Computational science cookbook II post, you’ll know that it’s very likely that multiple files will be involved in setting up, running and post processing your simulations. The trouble is translation - How do you correlate the output with the input?
I wanted to go into depth about the kinds of rudimentary database management you can use for running physics simulations in a way that’s quick to set up and reliable.
lazyDB: database management through file names
There are famously two opposing paradigms in CS: functional and object oriented. The former works only with pure functions, the latter encourages a complex stateful interplay of data-contianing objects, and both have their respective strengths.
In some sense the “functional” approach to naming output files is that Every output’s filename contains all of the information needed to recreate it.
E.g. suppose you have a numerical Schrodinger equation solver, your output files will look like
run1?version=1.5?nsteps=200?potential=[9,4,2,1,2,4,9]?n=5?method=exactdiag_wf.csv
whereas the stateful-paradigm approach would have output files named like
run2?version=1.5?infile=input/parabolic_9_wf.csv
Keeping everything in the filename looks hacky, but at the end of the day it’s how the Web works:
GET requests look like
https://duckduckgo.com/?q=databases+for+scientific+computing&t=newext&atb=v298-1&ia=web
, and they
power almost everything you see.
Pros | Cons |
---|---|
Completely obvious what is happening | Long filenames are limiting (limited to 256 chars on UNIX) |
Remembers the state of the input file when it is read | Difficult to see what is happening at a glance |
No risk of overwriting data with different parameters | |
Easy to label graphs when parsing output | |
Robust to any modifications to input files |
The stateful-paradigm, input-file approach has the opposite situation: | Pros | Cons | | Arbitrarily large, complex input structures can be used | Input files must be made read-only to avoid later confusion | | If well-named, output file names contain only relevant information |
I want to stress that storing parameters in the filenames of output you generate is a good practice, even if it looks ugly / feels hacky. It’s only going to be an issue if you plan on radically changing the meaning of those parameters, which is easily avoided by giving it a different name, or better yet storing a “version” field in the filename.
Generally speaking, I find myself dealing with around 10 numerical parameters (that shouldn’t affect the answer much) and 4-5 physical parameters. When doing many trials (as in a Monte Carlo simulation) or many similar computations (as in a parameter sweep), almost none of these change: it seems wasteful to clutter the output folder with a bunch of parameters that are 90% identical.
I therefore suggest a compromise - Just do both!
Specifically, my basic_parser
class implements two interface functions; from_file
and
from_argv
. Simulation files then start with the verbose, but transparent paragraph
// Optional parameter variables
unsigned
nreps;
double
phi,
T;
// The header row for any CSV outputs
param_t p("My program v" + VERSION + currentDateTime() );
// Bind all of the globals
p.delcare("nreps", &nreps);
p.declare("phi", &phi);
p.declare("temperature", &T);
//read the params from file
p.from_file(argv[1]);
std::filesystem::path output_path(argv[2]);
// Extract the input file's root and use it as the base for the output file
std::filesystem::path in(argv[1]);
// remove the file extension
std::string prefix(in.filename().replace_extension());
// Append all of the command line arguments from argv[3] and on as filename annotations
prefix += p.from_argv(argc, argv, 3);
// Ensure that we did not miss anything
p.assert_initialised();
// calculations
std::ofstream ofs(prefix + "_output1.csv");
This simulation has three input parameters nreps
, phi
and
temperature
, where we want to sweep phi at fixed temperature.
Have the simulation program read the file high_temp_template.toml
consisting of
# measured in units of J
temperature = 100
# number of iteration steps
nreps = 10000
# the coupling constant
# phi = 0
Note that the phi
specification is commented out, so it must be passed from the command line:
bin/sim input/high_temp.template.toml output_folder/ --phi=0.56
This produces an output (relative to the directory that the above line was run in) looking like
output/high_temp?v=1.5?phi=0.56.quantity1.csv
output/high_temp?v=1.5?phi=0.56.quantity2.csv
The v
flag is hard-coded into the executable, meaning ‘version’, and in general you might want to
make several output files for the different quantities of interest. It might also be advisable to
add a timestamp to the header row of your CSV files as a sanity check.
Running a sweep is simple using a bash for loop
for phi in `seq 0 0.01 1`; do
bin/sim input/high_temp_template.toml output_folder/ --phi=$phi
done
This kind of problem is known as and “embarrassingly parallel” calculation: there are a huge number of threads that can all run completely independent of one another, avoiding the need for complicated openMP / MPI structures. Queuing all of these as jobs in parallel gets slightly trickier if you want to avoid spawning thousands of threads, but that’s a topic for another post.
RegEx
Regular Expressions, or regex, are an absolutely indispensable tool for parsing human-readable files. They are a kind of shorthand for grabbing parts of strings, which is helpful if important simulation parameters are stored in strings. For example, when generating a phase diagram you might have a folder full of output files similar to
run_2022-05-05%phi=0.2000;%T=1.00e-5;_order.csv
run_2022-05-05%phi=0.2000;%T=1.00e-3;_order.csv
run_2022-05-05%phi=0.2000;%T=1.00e-1;_order.csv
run_2022-05-05%phi=0.4000;%T=1.00e-5;_order.csv
run_2022-05-05%phi=0.4000;%T=1.00e-3;_order.csv
run_2022-05-05%phi=0.4000;%T=1.00e-1;_order.csv
...
This naming convention is inspired by the web, which uses ¶meter=value
to pass arguments to
the server without navigating to a new page. Now say you want to accumulate all of these in a python
file: it’s a pain to do using string slicing and dicing, but regex has a solution:
import re # the regex module
from glob import glob
import numpy as np
# Feed the stem in from the command line (shell autocomplete avoids needing to remember it properly)
files = glob(sys.argv[1] + '*_order.csv')
print("Averaging over\n")
print(files)
for file in files:
res = re.search("%phi=([0-9.+\-e]+);%T=([0-9.+\-e]+);", file)
phi = float(res.group(1))
T = float(res.group(2))
print(f"phi = {phi}, T = {T}")
data = np.genfromtxt(file)
# ... averaging and aggregating
Admittedly, regex looks like random noise the first time you look at it. Let’s break it down:
%phi=([0-9.+\-e]+);%T=([0-9.+\-e]+);
%phi= # Matches the exact chatacters "%phi="
( # Start of capture group 1
[0-9.+\-e] # Matches any of the charaters "0123456789.+-e"
+ # Any nonzero number of the above class
) # End of capture group 1
;%T= # Matches the literal characters ";%T="
( # Start of capture group 2
[0-9.+\-e] # Matches any of the charaters "0123456789.+-e"
+ # Any nonzero number of the above class
) # End of capture group 2
; # Matches the literal character ";"
There’s a fantastic website for trying out regex, with a dynamically
generated explainer below the input stream. This is an essential way to dry-run your regex strings
to ensure that they make sense.
Regex allows you to be a lot more versatile than the simple ?
delimited conventions mentioned
earlier. In particular, it allows you to add in escape sequences if the need arises.
vim
, VS Code, and most other text editors support using regex to search, which is an invaluable
tool when you want to refactor a piece of code.
Hard Mode: Using a real database
The canonical file format for complicated, array-heavy data structures is HDF5.
The downside of doing this is that parallelisation is suddenly a headache: having multiple threads contribute to the same file means you suddenly need to worry about locks, mutexes and famously difficult-to-debug race conditions.
My workflow
When running a simulation, I do the following:
- Create input files. Anything that will be fed in to the simulation is marked as readonly with
chmod a-w /path/to/inputv1.toml
. This avoids any accidental mutation of the mapping from { infile names } -> { parameters } over time. - Run the sweep using a bash
for
loop, or some other jig. - After processing the data, iterate on the original input with a new file, using a new file name.
This routine guarantees that data from months (or even years) ago remains usable - I can be confident that I understand exactly what went in to these simulations and what came out. Assuming that my git history is intact, I can even double-check the state of the simulation code at that point in time.