Managing results, or: What the hell did I do?

June 16, 2014

approximately 5 minutes to read

If you’re in research, one problem you’ll have at nearly every project is to manage your results. Most of the time, it’s just a bunch of performance numbers, sometimes it’s a larger set of measurements, and sometimes you have hundreds of thousand of data points you need to analyse. In this post, I’ll explain how I do it. Without further ado, let’s get started!

Measuring & storing: Big data for beginners

Measuring is often considered “not a problem” and done in the most simple manner possible. Often enough, I see people taking screenshots manually or copying numbers written in the console output. Those who do some structured output typically generate some kind of CSV and end up with hundred of text files in their file system.

Let me say this right away: Manual measurements are no measurements. Simply not worth the trouble. The chance that you’ll do a copy-and-paste error will rapidly approach 1 once the deadline comes closer. So it has to be automated somehow, but how?

Let’s take a look at the requirements first:

Data recording should be “crash”-safe. It’s very likely that your machine will crash while recording the benchmarks, or that someone will cut your power, the disk will fail or your OS update system will chime in and kill your program. You want all results so far to be safely recorded and no partially written results.
Data should be stored in a structured way: Results which belong together should be stored together. Ideally, the results should be self-describing so you can reconstruct them at some point later in time without having to look at your source code.
Data must be easy to insert & retrieve: It should be easy to find a particular datum again. I.e. if you want all tests which used this combination of settings, it should be possible, and not require a lot of manual filtering.
Test runs must be repeatable. You need to store the settings used to run the test.

Sounds like a complex problem? Yes, it is, but fortunately, there is one kind of software which solves exactly the problems outlined above: Databases. My go-to database these days is MongoDB, which is a document-oriented database. Basically, you store JavaScript objects, and MongoDB allows you to run queries over the fields of these objects.

When I say JavaScript objects, I really mean JSON. MongoDB is a JSON storage engine, using a highly efficient, binary representation. The real power comes when you fully embrace JSON throughout your tools as well, in particular, if you store all settings you need to run a test as JSON and if your tools return JSON. This will allow you to store the test input and output together, making it trivial to recover after a crash. If the test settings are already present in the database, you simply skip them. In my case, I typically store a document with a ‘settings’ field and a ‘results’ field.

Data insertion and retrieval is very simple. My test runners are all written in Python, and with PyMongo, you can directly take the JSON generated by your tools and insert it into the database. Processing JSON is trivial with Python, as a JSON object is also a valid Python object (it’s just a nested list or dictionary in Python), and with PyMongo, you basically can view your database as one large Python dictionary.

A huge advantage of using an NoSQL database like MongoDB over a SQL database is that you don’t have to think about your data layout at all. With SQL, you have to put a few smarts into the table structure and you’ll need different tables for different kinds of results. While it’s not a huge problem, it still requires set-up time which you can avoid by using a document storage. Performance wise, I’d say just don’t worry. Every database these days scales to huge data sets with millions of documents and fields with ease.

There’s only one case where using a separate database might be problematic, and that’s if you have to store gigabytes of data and very high speed and you can’t afford the interprocess communication. In this case, using an embedded database like SQLite might be necessary. If you can, you should also try to run the database on a separate machine for extra performance and robustness.

For me, the complete process looks as following:

A Python runner generates a test configuration.
The Python runner checks if the test configuration is already present in the database. If not, execution continues.
Depending on the tool, the test configuration is passed as JSON or as command-line options to the benchmark tool.
The benchmark tool writes a JSON file with the profiling results.
The Python runner grabs the result file, combines it with the configuration and stores it to the database.

This process allows me to benchmark even super-unstable tools which crash on every run. As long as you spawn a new process for each individual test configuration, no invalid results can be ever generated.

So much for data generation. Let’s take a look at the second part, analysis!

Analysis

For analysis, I’m using a great, Python-based combo: The new statistics module from Python 3.4, Matplotlib and NumPy.

The statistics module provides the basic functions like median and mean computation. For anything more fancy, you’ll want to use NumPy. NumPy allows you to efficiently process even large data sets, and provides you with every statistics function you’ll ever need.

Finally, you’ll want to have all plotting to be fully automated. Matplotlib works great for this – if you want a more graphical approach, you could also try Veusz. Matplotlib can automatically generate PDF documents which can be directly integrated into your publication.

Having everything automated at this stage is something you should really try to achieve. Chances are, you’ll fix some bug in your code close to the deadline, and at that point, you have to re-run all affected tests and re-generate the result figures. If you have everything automated, great, but if you do it manually, you’ll extremely like to make some mistake at this stage. Oh, and while automating, you should also generate result tables automatically and reference them from your publication – again, copy & paste is your worst enemy!

One note about the analysis: Installing all the packages can be a bit tricky on Windows; as there is only an unofficial 64-bit NumPy installer. You can spare yourself some trouble by moving the analysis over to Linux, where you can get everything in seconds using pip. Moving the database is a matter of seconds using mongodump and mongorestore. Even better, just let the database run on a second machine and you can analyse while your benchmark is running. If you use SQLite, you can also just copy the files over.

That’s it; I hope I could give you a good impression how I deal with results. I’m using the approach outlined above since a few years now to great effect. Now, every time it comes to measuring stuff, instead of “oh gosh, I hate Excel” you can hear me saying “hey, that’s easy” :)