Computational Science Cookbook III: Databases for the lazy
[ code , comp-sci ]

If you’ve read the yet-to-be-published Computational science cookbook II post, you’ll know that it’s very likely that multiple files will be involved in setting up, running and post processing your simulations. The trouble is translation - How do you correlate the output with the input?

I wanted to go into depth about the kinds of rudimentary database management you can use for running physics simulations in a way that’s quick to set up and reliable.

lazyDB: database management through file names

There are famously two opposing paradigms in CS: functional and object oriented. The former works only with pure functions, the latter encourages a complex stateful interplay of data-contianing objects, and both have their respective strengths.

In some sense the “functional” approach to naming output files is that Every output’s filename contains all of the information needed to recreate it.

E.g. suppose you have a numerical Schrodinger equation solver, your output files will look like

run1?version=1.5?nsteps=200?potential=[9,4,2,1,2,4,9]?n=5?method=exactdiag_wf.csv

whereas the stateful-paradigm approach would have output files named like

run2?version=1.5?infile=input/parabolic_9_wf.csv

Keeping everything in the filename looks hacky, but at the end of the day it’s how the Web works: GET requests look like https://duckduckgo.com/?q=databases+for+scientific+computing&t=newext&atb=v298-1&ia=web, and they power almost everything you see.

Pros Cons
Completely obvious what is happening Long filenames are limiting (limited to 256 chars on UNIX)
Remembers the state of the input file when it is read Difficult to see what is happening at a glance
No risk of overwriting data with different parameters  
Easy to label graphs when parsing output  
Robust to any modifications to input files  

The stateful-paradigm, input-file approach has the opposite situation: | Pros | Cons | | Arbitrarily large, complex input structures can be used | Input files must be made read-only to avoid later confusion | | If well-named, output file names contain only relevant information |

I want to stress that storing parameters in the filenames of output you generate is a good practice, even if it looks ugly / feels hacky. It’s only going to be an issue if you plan on radically changing the meaning of those parameters, which is easily avoided by giving it a different name, or better yet storing a “version” field in the filename.

Generally speaking, I find myself dealing with around 10 numerical parameters (that shouldn’t affect the answer much) and 4-5 physical parameters. When doing many trials (as in a Monte Carlo simulation) or many similar computations (as in a parameter sweep), almost none of these change: it seems wasteful to clutter the output folder with a bunch of parameters that are 90% identical.

I therefore suggest a compromise - Just do both!

Specifically, my basic_parser class implements two interface functions; from_file and from_argv. Simulation files then start with the verbose, but transparent paragraph


// Optional parameter variables
    unsigned
        nreps;
    double
        phi,
        T; 
    
	// The header row for any CSV outputs
    param_t p("My program v" + VERSION + currentDateTime() );

    // Bind all of the globals
    p.delcare("nreps", &nreps);
    p.declare("phi", &phi);
    p.declare("temperature", &T);

    //read the params from file
    p.from_file(argv[1]);

    std::filesystem::path output_path(argv[2]);

    // Extract the input file's root and use it as the base for the output file
    std::filesystem::path in(argv[1]);
	// remove the file extension
    std::string prefix(in.filename().replace_extension());
	// Append all of the command line arguments from argv[3] and on as filename annotations
    prefix += p.from_argv(argc, argv, 3);

	// Ensure that we did not miss anything
    p.assert_initialised();

	// calculations
	std::ofstream ofs(prefix + "_output1.csv");

This simulation has three input parameters nreps, phi and temperature, where we want to sweep phi at fixed temperature.

Have the simulation program read the file high_temp_template.toml consisting of

# measured in units of J
temperature = 100
# number of iteration steps
nreps = 10000
# the coupling constant
# phi = 0

Note that the phi specification is commented out, so it must be passed from the command line:

bin/sim input/high_temp.template.toml output_folder/ --phi=0.56

This produces an output (relative to the directory that the above line was run in) looking like

output/high_temp?v=1.5?phi=0.56.quantity1.csv
output/high_temp?v=1.5?phi=0.56.quantity2.csv

The v flag is hard-coded into the executable, meaning ‘version’, and in general you might want to make several output files for the different quantities of interest. It might also be advisable to add a timestamp to the header row of your CSV files as a sanity check.

Running a sweep is simple using a bash for loop

for phi in `seq 0 0.01 1`; do
bin/sim input/high_temp_template.toml output_folder/ --phi=$phi
done

This kind of problem is known as and “embarrassingly parallel” calculation: there are a huge number of threads that can all run completely independent of one another, avoiding the need for complicated openMP / MPI structures. Queuing all of these as jobs in parallel gets slightly trickier if you want to avoid spawning thousands of threads, but that’s a topic for another post.

RegEx

Regular Expressions, or regex, are an absolutely indispensable tool for parsing human-readable files. They are a kind of shorthand for grabbing parts of strings, which is helpful if important simulation parameters are stored in strings. For example, when generating a phase diagram you might have a folder full of output files similar to

run_2022-05-05%phi=0.2000;%T=1.00e-5;_order.csv
run_2022-05-05%phi=0.2000;%T=1.00e-3;_order.csv
run_2022-05-05%phi=0.2000;%T=1.00e-1;_order.csv
run_2022-05-05%phi=0.4000;%T=1.00e-5;_order.csv
run_2022-05-05%phi=0.4000;%T=1.00e-3;_order.csv
run_2022-05-05%phi=0.4000;%T=1.00e-1;_order.csv
...

This naming convention is inspired by the web, which uses &parameter=value to pass arguments to the server without navigating to a new page. Now say you want to accumulate all of these in a python file: it’s a pain to do using string slicing and dicing, but regex has a solution:

import re # the regex module
from glob import glob
import numpy as np

# Feed the stem in from the command line (shell autocomplete avoids needing to remember it properly)
files = glob(sys.argv[1] + '*_order.csv')
print("Averaging over\n")
print(files)
for file in files:
    res = re.search("%phi=([0-9.+\-e]+);%T=([0-9.+\-e]+);", file)
    phi = float(res.group(1))
    T = float(res.group(2))
    print(f"phi = {phi}, T = {T}")
    data = np.genfromtxt(file)
    # ... averaging and aggregating

Admittedly, regex looks like random noise the first time you look at it. Let’s break it down:

%phi=([0-9.+\-e]+);%T=([0-9.+\-e]+);
%phi=                                  # Matches the exact chatacters "%phi="
     (                                 # Start of capture group 1
      [0-9.+\-e]                       # Matches any of the charaters "0123456789.+-e"
                +                      # Any nonzero number of the above class
                 )                     # End of capture group 1
                  ;%T=                 # Matches the literal characters ";%T="
                      (                # Start of capture group 2
                       [0-9.+\-e]      # Matches any of the charaters "0123456789.+-e"
                                 +     # Any nonzero number of the above class
                                  )    # End of capture group 2
                                   ;   # Matches the literal character ";"

There’s a fantastic website for trying out regex, with a dynamically generated explainer below the input stream. This is an essential way to dry-run your regex strings to ensure that they make sense. Regex allows you to be a lot more versatile than the simple ? delimited conventions mentioned earlier. In particular, it allows you to add in escape sequences if the need arises.

vim, VS Code, and most other text editors support using regex to search, which is an invaluable tool when you want to refactor a piece of code.

Hard Mode: Using a real database

The canonical file format for complicated, array-heavy data structures is HDF5.

The downside of doing this is that parallelisation is suddenly a headache: having multiple threads contribute to the same file means you suddenly need to worry about locks, mutexes and famously difficult-to-debug race conditions.

My workflow

When running a simulation, I do the following:

  1. Create input files. Anything that will be fed in to the simulation is marked as readonly with chmod a-w /path/to/inputv1.toml. This avoids any accidental mutation of the mapping from { infile names } -> { parameters } over time.
  2. Run the sweep using a bash for loop, or some other jig.
  3. After processing the data, iterate on the original input with a new file, using a new file name.

This routine guarantees that data from months (or even years) ago remains usable - I can be confident that I understand exactly what went in to these simulations and what came out. Assuming that my git history is intact, I can even double-check the state of the simulation code at that point in time.