Research code: An inside look

I've heard many complaints that researchers don't release their source code. The argumentation usually goes like this: In order to make use of a technique, you should release the source code in such a way that everyone can quickly try the algorithm and ideally hack on it. Researchers however just don't care enough and that hurts everyone. Yes, there are some cases where the code should be definitely released -- in particular, shader code or a ray-tracing kernel -- but in some cases, it is not just the researcher's laziness that prevents the code from being released.

First of all, releasing code requires a license. Unfortunately, that is usually much more complicated than it seems. In many cases, some of the code comes from a third party, be it a co-researcher, a student having written some code as part of this thesis or some large library. Finding all authors and asking for a license is sometimes impossible, as the author might have left the research group or the code could be really old. Even worse, what if some of the code is under GPL? Releasing everything under GPL makes it near impossible for a game developer to pick up the code, as you are in constant danger of violating the GPL. I know of several companies which simply prohibit looking at GPL code.

There's also a pet-peeve of mine which are code and compiler dependencies. When I talk to game developers, they always frown upon complex code and god beware of Boost, a pretty fine and easy-to-use C++ library. However, writing all the code as simple C code is not the best way to spend your time as a researcher. In some cases, using the Boost file system library can save some time. The same applies for complex, template only libraries like Eigen. Sure, it's not like that's the first solution I would try, but before writing my own cholesky decomposition, I'm  going to use the one from Eigen. At the same time, I'm well aware that this renders the code near unusable to a game developer. Even worse, some of the code might be written in a different language, making it impossible to reuse directly. In our assisted texture assignment work for instance, half of the pre-processing pipeline was written in Python.

As mentioned above, research code has sometimes some dependencies. These can be small; for instance, an image loading library; or really large like a BLAS library. In some cases, the dependencies are not public. A research institute might have an in-house fluid solver, optimized math routines, or other code which cannot be easily open sourced. Ripping out such dependencies is no simple task. If you think that this is the worst case: Dependencies might also come with NDAs, making it impossible to release the code without asking a lawyer. A typical example are hardware vendors who provide frameworks, drivers and APIs for unreleased hardware under strict NDAs. Removing such code without revealing how the API under NDA looks like is very tricky. Releasing it after the NDA expires is equally complicated, as the API might look slightly different or some semantics might have changed (or in some cases, the library is simply never released.)

Game developers typically assume that the code is written using the same toolkits as they use, or at least some tool chain which is reasonably close. Too bad if a researcher uses Linux, and the code doesn't even compile on Windows. Porting the code might be impossible, as it can depend on compiler-specific or platform-specific stuff. Even if the compiler is the same, machine specific optimizations or dependencies can prevent the code from running on anything not resembling the researcher's workstation. Point in case: I know a few projects which require machines with insane amounts of memory to run (50 GiB or so in the year 2011) because nobody bothered to fully resolve bugs in the memory caching system. Remember, researchers focus on getting results done, and it's always faster to just use more RAM or CPU than to write and debug an optimization.

That's for the hardware; the data is equally problematic. Game developers typically like to rant that researchers don't test on real-world data. I can tell you: Getting hands on real-world data can take weeks, and getting permission to actually publish work with it a few weeks more. The cases where someone was allowed to actually ship content from a game are few and between. On this end, having a standard source for data sets would definitely help, but we're not there yet.

Those points mentioned above are not the whole story, there's a lot of other stuff as well like mixed code from multiple (possibly not yet published) projects, code quality, hard-coded data, large frameworks that are difficult to build, dependencies that are crucial but exist as binary only, and other problems. Believe me: Every researcher would like to release his code in an easy-to-use package, but in most cases, doing so would consume either prohibitive amounts of time, need a lawyer, and sometimes, a time machine. If you really can't understand some paper, I would recommend trying to get in touch with the researcher first. In particular, if you have a hard time understanding some inner loop or other tricky part of the code, I would guess that every researcher would be happy to give you both the code to that part and a detailed explanation as well.

Comments

Comments powered by Disqus