Research code: An inside look

I've heard many complaints that researchers don't release their source code. The argumentation usually goes like this: In order to make use of a technique, you should release the source code in such a way that everyone can quickly try the algorithm and ideally hack on it. Researchers however just don't care enough and that hurts everyone. Yes, there are some cases where the code should be definitely released -- in particular, shader code or a ray-tracing kernel -- but in some cases, it is not just the researcher's laziness that prevents the code from being released.

First of all, releasing code requires a license. Unfortunately, that is usually much more complicated than it seems. In many cases, some of the code comes from a third party, be it a co-researcher, a student having written some code as part of this thesis or some large library. Finding all authors and asking for a license is sometimes impossible, as the author might have left the research group or the code could be really old. Even worse, what if some of the code is under GPL? Releasing everything under GPL makes it near impossible for a game developer to pick up the code, as you are in constant danger of violating the GPL. I know of several companies which simply prohibit looking at GPL code.

There's also a pet-peeve of mine which are code and compiler dependencies. When I talk to game developers, they always frown upon complex code and god beware of Boost, a pretty fine and easy-to-use C++ library. However, writing all the code as simple C code is not the best way to spend your time as a researcher. In some cases, using the Boost file system library can save some time. The same applies for complex, template only libraries like Eigen. Sure, it's not like that's the first solution I would try, but before writing my own cholesky decomposition, I'm  going to use the one from Eigen. At the same time, I'm well aware that this renders the code near unusable to a game developer. Even worse, some of the code might be written in a different language, making it impossible to reuse directly. In our assisted texture assignment work for instance, half of the pre-processing pipeline was written in Python.

As mentioned above, research code has sometimes some dependencies. These can be small; for instance, an image loading library; or really large like a BLAS library. In some cases, the dependencies are not public. A research institute might have an in-house fluid solver, optimized math routines, or other code which cannot be easily open sourced. Ripping out such dependencies is no simple task. If you think that this is the worst case: Dependencies might also come with NDAs, making it impossible to release the code without asking a lawyer. A typical example are hardware vendors who provide frameworks, drivers and APIs for unreleased hardware under strict NDAs. Removing such code without revealing how the API under NDA looks like is very tricky. Releasing it after the NDA expires is equally complicated, as the API might look slightly different or some semantics might have changed (or in some cases, the library is simply never released.)

Game developers typically assume that the code is written using the same toolkits as they use, or at least some tool chain which is reasonably close. Too bad if a researcher uses Linux, and the code doesn't even compile on Windows. Porting the code might be impossible, as it can depend on compiler-specific or platform-specific stuff. Even if the compiler is the same, machine specific optimizations or dependencies can prevent the code from running on anything not resembling the researcher's workstation. Point in case: I know a few projects which require machines with insane amounts of memory to run (50 GiB or so in the year 2011) because nobody bothered to fully resolve bugs in the memory caching system. Remember, researchers focus on getting results done, and it's always faster to just use more RAM or CPU than to write and debug an optimization.

That's for the hardware; the data is equally problematic. Game developers typically like to rant that researchers don't test on real-world data. I can tell you: Getting hands on real-world data can take weeks, and getting permission to actually publish work with it a few weeks more. The cases where someone was allowed to actually ship content from a game are few and between. On this end, having a standard source for data sets would definitely help, but we're not there yet.

Those points mentioned above are not the whole story, there's a lot of other stuff as well like mixed code from multiple (possibly not yet published) projects, code quality, hard-coded data, large frameworks that are difficult to build, dependencies that are crucial but exist as binary only, and other problems. Believe me: Every researcher would like to release his code in an easy-to-use package, but in most cases, doing so would consume either prohibitive amounts of time, need a lawyer, and sometimes, a time machine. If you really can't understand some paper, I would recommend trying to get in touch with the researcher first. In particular, if you have a hard time understanding some inner loop or other tricky part of the code, I would guess that every researcher would be happy to give you both the code to that part and a detailed explanation as well.

Be careful with that shared_ptr, my friend

If you are developing in C++, chances are high that you are using shared_ptr. At least on this blog, my article on shared_ptr is by far the one with the most hits. However, if you are using shared_ptr nearly everywhere, you are doing it wrong.

Over the years, my attitude to shared pointers has shifted. At the beginning, I was all in on shared pointers, wrapping nearly every call to new. While this solved all memory leaks once and for all, there were also a few subtle cases where one object would be kept alive while some dependency could die and other minor lifetime issues. Over time, these small  issues became more and more of a problem, and as it turns out, they are intimately tied to shared pointers. The question you should always ask yourself is: Are the object ownerships clearly expressed?

If your answer is no, chances are high that some object can be floating with the help of a shared pointer while a dependency of it has been destroyed already, leaving you with unusable yet existing objects. A simple example: If you have a class which represents a file system, which returns file handles by shared pointer, what happens when the file system is destroyed? If you have given out shared pointers, you are doomed right now, as the clients will assume that their pointers are valid while in fact they are pointing to zombies. Recently, I did a large sweep over my codebase to remove shared pointers in such cases and replace them with handles (or in some cases raw pointers) -- which more clearly expresses what is going on, making the code safer. You can still allow users to wrap a handle using a shared pointer and a custom deleter, but its explicit now and no longer implicit.

Sure, you can solve such problems with weak and shared pointers, and by referencing the parent from the children and even more code, but does it really make things easier? The easier the code, the better, and with shared pointers, it's really easy to dig oneself into a hole from which it becomes increasingly complicated to get out. A simple handle makes ownership crystal clear and can simplify the code a lot.

The second problem arises once you return objects as shared pointers. While is doesn't sound like a big deal at first, in practice it is. In most cases, the clients will only need an object for a short time. For instance, you might pass a file handle directly into an image loading function, and destroy it afterwards. In this case, using a shared pointer is simply a waste: It requires more memory to be allocated, atomic operations and generates more code. The core problem here is that returning a shared pointer is a very strong statement; indicating that clients will always want to reference count the object. If that is not the case, you still enforce the overhead of it onto your clients.

Fortunately, there is a much better alternative: In C++11 (and already supported by Clang, MSVC10 and GCC 4.6) you can return a unique pointer instead. It transparently converts into a shared pointer if necessary, but it's a much more lightweight object and does not force the clients into any particular lifetime management solution. During the refactoring mentioned above, I also replaced pretty much every single function that returned a shared pointer into one returning a unique pointer. It turned out that on most call-sites, there was indeed no need for a shared pointer. As a side effect, the binary size was also reduced.

There are surely many places where shared pointers are the best solution, and I definitely would not recommended going back to manual memory management. However, shared pointers are not the only solution, and you should carefully consider if the resulting code is indeed simpler and easier to reason about. If not, consider handles or other, more explicit means of expressing ownership.

What is OpenCL?

This is a short, basic introduction to OpenCL targeted at customers who are curious to understand how software works and for developers who are not yet familiar with massively parallel programming.

As a consumer, you might wonder why your new mobile phone comes with a quad-core processor and what applications can take advantage of it. Similarly, if you have a notebook, you probably have multiple cores right now, yet some applications like a text processor don't run faster while others like image processing do benefit a lot. How comes? As a developer, you might have come to the point where you try to rewrite parts of your application to benefit from multi-threading, and you wonder why this is so complicated using the OS interfaces?

Due to NVIDIA's excellent marketing, you've probably already heard about CUDA. On notebooks, there's often a greenish "CUDA enabled" sticker. But what does it actually mean? And how does CUDA fit into the big picture?

The core problem in the hardware space right now is power usage; battery life in mobile devices is very important, just as efficiency in desktop or notebook PCs. What happened is that it's no longer possible to run a single program faster -- on the other hand, multi-core CPUs can run multiple programs at the same time. Each of them might run just as fast as it did a few years ago, but by running more of them, the overall throughput increases. That's the reason we're seeing more and more cores even in mobile phones. Graphics cards are also a type of processor with lots and lots of processing cores.

The big question with all these cores is how to make efficient use of them. What CUDA brought to the table was a programming model, inspired by the graphics APIs, which we now consider the best approach for highly parallel programs. This programming model brings strong constraints -- for instance, communication between elements is limited, memory accesses are more complicated, but it allows certain problems to be efficiently solved. For instance, a lot of image processing tasks like blurring or adjusting colours maps very well to this programming model. However, if an application is designed for CUDA, it also means that it is limited to NVIDIA's GPUs. This may be fine, but sometimes you don't have a GPU, sometimes the memory on the GPU is not enough, and sometimes AMD's GPUs might be just faster at a given problem.

Enter OpenCL: OpenCL is a standardized formulation of the parallel programming model, with similar constraints as CUDA, but with a much wider hardware support. From mobile phones over graphics cards to CPUs, OpenCL provides an unified interface for software developers. For you as a customer, this means you have to care less about the particular device at hand. Your image processing suite will work just fine on your smartphone, on your notebook, and if you move it to you desktop PC, you will get better performance, but in every case, the software will use the hardware efficiently. With CUDA, what might happen is that on your notebook without an NVIDIA card, a tool will only use one CPU core and burn a lot of power. With OpenCL, chances are that it will all CPU cores and the integrated graphics chip as well. This will result in better performance and lower energy use.

For you as a customer, OpenCL is yet another technique which makes your software run faster and improves battery life/power efficiency. It also makes it easier for you to compare and choose hardware which works best for your problem, as you get more choice. Finally, OpenCL is also heading to the web: In the future, we can expect image processing tools which are running completely in the browser. These tools are highly likely to take advantage of OpenCL.

For you as a developer, OpenCL provides an API to target a lot of massively parallel hardware platforms with the same code. This means less duplication, easier development and easier deployment. If you haven't given it a try yet, you should at least take a look now. Parallel programming is here to stay, and OpenCL provides the most gentle introduction to it.

Why you should use LLVM's ArrayRef

If you are developing for C++, one major source of errors are out-of-bounds accesses when passing around pointers. Even though these pointers come from locations where the buffer size is typically known, errors occur due to the following problem: When you pass around a plain pointer, all information about the size of the buffer is lost at the usage location. That makes it impossible to verify that the buffer has the correct size when writing through it. Luckily, the solution for this is really simple:LLVM's ArrayRef.

An array reference is a simple structure which bundles a pointer together with the size of the memory pointed to. The simplest implementation is:

template <typename T>
struct ArrayRef
{
    T p;
    int64 size;
};

Keeping the size along with the pointer immediately removes a large class of errors. This brings C++ closer to C# and Java, where all arrays known their bounds, without enforcing the clients to use a particular container class.

As a side bonus, it also decouples the functions from the actual data representation. Instead of forcing clients to pass a standard container, or a pair of iterators, array references handle all those cases transparently. This also works for output buffers by using mutable array references which clearly express that this pointer is an output pointer. This removes another possible source of confusion.

Over the last months, I have started to replace raw pointers to buffers in my framework with array references and validation. It didn't take long before I found the first bunch of bugs where a buffer was too small or too large. It might sound a bit counter-intuitive at first to check if an output buffer is too large, but this turned out to be a useful check. Most of the time, the caller simply overestimates the required memory, so this is a good point to provide an accessor or utility function to get the exact memory and reduce temporary storage requirements. Otherwise, the caller is probably creating a slice from a larger memory buffer, and in that case, it's usually just as simple to compute the slice size precisely. The only case I found where I only check if the buffer is larger or equal is for compression algorithms.

If you haven't seen or used array references, it's time to give them a try and practice safer C++.

Symstore for easier PDB handling

If you're developing on Windows, you are probably familiar with .PDB files. Those files provide debug information for binaries, making them invaluable when debugging crash dumps and other issues. Unfortunately, raw PDBs have some problems:

  • They are built for one binary only: Rebuilding your project from source control will yield a different binary and hence PDB file, so you really need to archive the matching PDBs for each released binary.
  • They are usually really large: We're talking hundreds of MiB for a release, and you need to keep around many of them.
  • Matching a binary to the PDB requires you to set up the path to the PDB. This can become cumbersome if you have released 10 different versions.

Fortunately, there is a solution which solves all of this problems at the same time: Enter symstore, a tool for creating of a symbol store.

Symstore is a Windows SDK tool which creates a symbol store. You can find it for instance in the Windows 8 SDK under the debuggers directory. A symbol store is just a folder with a special layout and some additional information, which can store multiple PDB versions in such a way that Visual Studio can pick them up easily. You just point Visual Studio to the symbol store folder (which can be also on a network drive), and if the PDB for the binary you want to debug is present in the store, it just works. This works for multiple versions of the same binary as well.

All you need is to call the symstore binary and add the PDBs to the symbol store on each release. It doesn't matter from which machine this is done, as all required information is stored in the symbol store. In my case, the deployment script takes care of this while preparing a release. And as a final bonus, file size is also no longer a problem as the symbol store comes with a compression option which typically reduces the PDB size by 80%. That's great when the network storage location doesn't support file compression.