Running OpenCL Cycles using Blender 2.71 & AMD GPUs

If you are using Blender on AMD, you probably have noticed that Cycles doesn't support AMD GPUs. However, for some scenes, Cycles is actually working just fine and might be worth a try. Here's a quick how to guide.

First of all, if you are on Linux, make sure you have the AMD proprietary driver installed. You can grab it from the AMD homepage. Next, you need to set up an environment variable in order to make Cycles "see" the AMD OpenCL device. On Windows, you can open a command window (Shift-Right-Click on your Blender installation folder and use "Open command promp here..."), then type in:

set CYCLES_OPENCL_TEST=all
.\blender.exe

On Linux, you can use:

CYCLES_OPENCL_TEST=all ./blender

After Blender has started, you now have to pick your compute device. Under "User preferences", "System", you can find the "Compute device" option. For AMD, pick your GPU code name as the device. The code name for R290 and R9 290X is "Hawaii", for R9 280, R9 280X, HD 7950 and HD 7970 "Tahiti". Other codenames are "Pitcairn" and "Bonaire"; these are the GCN based cards -- I doubt Cycles will run correctly on older cards.

cycles-select-gpu-device
cycles-select-gpu-device-2

Anyway, once set up, if everything works right, you'll get GPU accelerated Cycles!

blender-cycles-amd-2
blender-cycles-amd
blender-cycles-bmw

However, as of August 2014, using the Catalyst 14.6 drivers and Blender 2.71, not everything works correctly. In one scene, I'm having issues with incorrect intersections which only occur on the GPU device:

cycles-correct
cycles-incorrect

Interestingly, I have also written my own tiny OpenCL raytracer where I had similar issues. What was happening is that the compiler computed incorrect ranges from a condition and then later on miss-compiled some code. In my case, I had code like this:

if (cullBackface) {
    if (U<0.0f || V<0.0f || W<0.0f) return false;
} else {
    // Condition A
    if ((U<0.0f || V<0.0f || W<0.0f) &&
            (U>0.0f || V>0.0f || W>0.0f)) return false;
}

// Some more code
const float T = U*Az + V*Bz + W*Cz;

// Some time later
if (cullBackface) {
    if (T < ray->tMin * det || T > ray->tMax * det)
        return false;
} else {
    // This is not correct (should work on the sign bit directly,
    // but this is what triggered the bug)
    const bool nearClip = fabs (T) < ray->tMin * fabs (det);
    const bool farClip = fabs (T) > ray->tMax * fabs (det);
    if (nearClip || farClip) {
        return false;
    }
}

What happened was that if cullBackface was statically known to be false, nearClip and farClip would be always true, as the call to fabs (T) was never executed. It seems that the condition A would lead the compiler to believe that U,V or W would have some invalid value after the condition. I'm not claiming that this is the same bug that is happening in Cycles, but given that ray-tracers are very sensitive to these kinds of optimizations, I wouldn't be surprised if something similar affects Cycles.

By the way, the bug-fix is to make condition A a bit more complicated.

const bool lessZero = U < 0 || V < 0 || W < 0;
const bool greaterZero = U > 0 || V > 0 || W > 0;

if (lessZero && greaterZero) {
    return false;
}

Before you ask, the ray/triangle intersection code is based on "Watertight Ray/Triangle Intersection"; and trust me, I was really puzzled when it wasn't quite as watertight as expected :)

[Update:] There was a bug in the code. Here's the correct version of the code above:

const float detSignMask = as_float (signbit (det) << 31u);

// This is the code from the repository, which is not equivalent to
// the paper. In the paper, this line would read:
// const bool nearClip = xorf (T, detSignMask) < 0;
const bool nearClip = xorf (T, detSignMask) < (ray->tMin * fabs (det));
const bool farClip = xorf (T, detSignMask) > (ray->tMax * fabs (det));

with

float xorf (float a, float b)
{
     return as_float (as_uint (a) ^ as_uint (b));
}

Associating OpenCL device ids with GPUs

What's more fun than one GPU? Two of them, of course. However, if you are using OpenCL from multiple processes, things get a bit hairy once you have multiple GPUs in a machine. A typical example would be MPI: With MPI, you'll want to spawn one process per GPU. The problem you're going to run into is how to assign GPUs, or rather, OpenCL devices, to processes.

The issue is that if you have two identical GPUs, you can't distinguish between them. If you call clGetDeviceIds, the order in which they are returned is actually unspecified, so if the first process picks the first device and the second takes the second device, they both may wind up oversubscribing the same GPU and leaving the other one idle.

What we need is to get a persistent, unique identifier for each device which remains stable between processes, so we can match an OpenCL device id to a physical GPU. There's no such thing in standard OpenCL, but luckily for us, there are some severely under-documented, vendor specific extensions which can help us.

AMD

On AMD, you want to use the cl_amd_device_topology extension. This extension works on both Linux and Windows and can be used to query the PCIe bus, which is unique for each GPU. Let's take a look how this works:

// This cl_ext is provided as part of the AMD APP SDK
#include <CL/cl_ext.h>

cl_device_topology_amd topology;
status = clGetDeviceInfo (devices[i], CL_DEVICE_TOPOLOGY_AMD,
    sizeof(cl_device_topology_amd), &amp;topology, NULL);

if(status != CL_SUCCESS) {
    // Handle error
}

if (topology.raw.type == CL_DEVICE_TOPOLOGY_TYPE_PCIE_AMD) {
    std::cout << "INFO: Topology: " << "PCI[ B#" << (int)topology.pcie.bus
        << ", D#" << (int)topology.pcie.device << ", F#"
        << (int)topology.pcie.function << " ]" << std::endl;
}

This will give you a unique id for each GPU in your machine. You can also find this information in the AMD APP OpenCL programming guide, in the appendix.

NVIDIA

For NVIDIA, the approach is very similar. The cl_nv_device_attribute_query extension supports two undocumented tokens for clGetDeviceInfo, CL_DEVICE_PCI_BUS_ID_NV (0x4008) and CL_DEVICE_PCI_SLOT_ID_NV (0x4009), which return the same information. Testing indicates that the return value is an integer. Unfortunately, I couldn't find any documentation about this, but trust me, this works :)

Combined approach

The combined approach is to query the device vendor first, and then try to obtain the information. I combine it into an opaque 64-bit number which I associate with a device (on AMD, I merge the device and bus, on NVIDIA, the slot and bus.) I'm curious to hear how this is supposed to work for multiple Intel Xeon Phi, if you know, please drop me a line or comment!

Acknowledgements

Thanks to Herve & Markus for their help! Undocumented functions are sure fun ;)

Some notes on data processing

This is the long explanation behind a recent tweet:

https://twitter.com/NIV_Anteru/status/489349219120340993

The problem I'm aiming at here is processing of large data sets. Imagine you have a 500 million polygon model you need to compute normals for. The first idea is pretty simple, let's open the model, load it, compute per triangle normals and write them back into the same file. Voilà, no temporary disk space used, data is mutated in place and everything is fine. Right?

Turns out, any kind of in-place mutation is actually the wrong thing to do when it comes to big data sets. To understand why, let's look at a two use-cases and then we'll take a short trip into computer science 101 to get the solution.

How do you compose operations? You want to triangulate, create level-of-detail, sort your mesh and extract vertex positions. For this example, the vertex position extraction simply identifies unique vertices and only keeps them, removing the topology information. Now you want the vertex positions of the triangulated, simplified mesh and of the initial mesh. If you mutate in-place, this means you have to make a copy of your mesh, run the vertex position extraction, then take the input mesh again, run your triangulation and simplification, make a copy again, and then run the vertex extraction. All because the vertex position extraction is in-place and destructive.

How do you cache? If your data is mutable, you cannot cache outputs or inputs of one processing step. Let's assume you have a three step pipeline: Part one procedurally generates a mesh, the second part triangulates it and finally, you compute normals. Now you fix a bug in the normal computation. If you mutate in-place, your normal computation now needs a new operation mode to work on data which already has (incorrect) normals if you want to work on the "cached" triangulation output. Otherwise, you have to run again on the source data and triangulate & compute the normals from scratch. Oh, and if you fix something in the triangulation, you'll always have to run the procedural generation first, as the in-place mutated mesh cannot be fixed.

If you think a bit about the problems, then you'll notice that we're actually fighting the side-effects of our computation. If you know a bit about programming theory, you'll notice that these problems are exactly those which can be trivially solved by functional programming. Or to be more precisely, by modelling our operations as pure functions and treating all our data as immutable.

So what is a pure function? It's a function which given some input, will always produce the same output. On its own, it's not that amazing, but if you add immutable data, things get really interesting.

Immutable data might sound a bit weird at first, but it's actually part of the definition. If data would be mutable, running a function on the same data twice would yield different results. What it means in practice is that a pure function will construct a new result.

If you followed so far, you'll probably see where I'm heading: Disk space is cheap these days, and writing your code to mutate data in-place is not an optimization -- it's a serious problem which will make your code hard to maintain, hard to optimize and hard to reuse.

Let's look at our sample problems from above and how immutable data & functional programming solves them. Composition: Trivial. We simply chain our operations, and no copying will take place. Instead, the input will be read only once and handed over to the individual operators.

Caching: Again, trivial. We simply compute a hash value of each input. If we need to re-run some part, we'll only do so for the parts of the pipeline which actually changed. For example, if we fix the triangulation, we will run it for all inputs (which are cached). If some parts of the triangulation produce the same results, the normal generation won't run for them (as the input hash for that step will be the same.)

The major downside of this approach is that you'll be using a lot more disk space than before if you don't chain operations in memory. As mentioned before, I'm using a stream-oriented format for my geometry, which allows me to chain operations in-memory easily, but if you do have to work on complete meshes, you'll be using quite a bit more disk space in some cases than if you would do in-place mutation. For instance, changing vertex colours will now require you to re-write the whole file.

In practice, it turns out that more disk space is rarely an issue though, but composition, caching and robustness are. If you can compose functions without worrying, it's usually possible to perform a lot of steps in memory and only write the output once, without intermediate files. It's also easier to provide a node network to process data if every node has a well-defined input and output; in-place mutation is basically impossible to model in such a scheme.

Second, caching is trivial to add, as well as checksums for robustness. If you mutate in-place, there's always a risk that something will destroy your data. This is much lower if your input is immutable.

Finally, there are also advantages when it comes to parallel execution. These are well-known from functional languages; for example, you can freely change your evaluation order -- run a thread for each input element, or run a thread for each operator, or both.

It's interesting that on the run-time side, I tend to go for a more functional programming style whenever I can. In my C++ code, nearly everything is const -- functions, class members, and objects. This allows me to easily reason about data flow, thread safety and to provide strong unit tests. Yet, when thinking about GiB-sized files, I fell back into the "in-place", imperative programming mind-set.

On my tool side, I since moved nearly everything over to completely functional & immutable processing. Right now, there's only one function left which has to be converted to a pure function, which will be also fixed soon hopefully. So far, the new approach have turned out to be successful. The only downside I had was that I have to be a bit more careful about intermediate outputs, if I don't want to waste disk space, but in practice, this was far less problematic than I anticipated.

To conclude this: Mutable data is bad; be it run-time or disk. Even for large files, the short-term gains in disk I/O performance and usage are not worth it in the long run. Just say no and go for a functional approach:)

Living with Linux - 6 months recap

Time flies by -- just a moment ago I switched to Linux for development, and suddenly half a year has passed. Time to look back!

Status

I'm running Linux now at work and at home as the primary OS, that is, if I don't do anything, I end up in Linux, all my e-mails are on Linux, my instant messaging, etc.

At home, Windows is only left for three things: Gaming, photos, and some development. At work, I only boot Windows if I need to run some OpenCL/graphics interop. The vast majority of the time, I'm working on Linux these days.

The Linux I use is Ubuntu 14.04, or to be more specific, the Kubuntu flavour. Let's take a look at what works and what doesn't!

The good

  • Graphics driver work better than expected. I've followed all beta, intermediate, hot-fix and other releases from AMD from the last few months, and I didn't brick my machine once. The drivers have gotten also a lot more stable, initially, I couldn't even boot into the KDE desktop without manual intervention. Today, you can leave an OpenGL application running all weekend long and even remove a monitor and everything will still run. Notice that I was running brand-new hardware on a really recent distribution which was not officially supported. This also improved, for Ubuntu 14.04, AMD had nearly day one support.
  • Clang and GCC are seriously faster than Visual Studio. I had to move the build onto a SSD for Visual Studio to get comparable speed to Clang & GCC. Still, running just my unit tests is near instantaneous on Linux, while on Windows, it's a few seconds just to do a "null" build.
  • Even though I'm primarily working on Linux now, I don't break the Visual Studio build too often. On the other hand, working on Visual Studio only, I regularly checked in things which would break the Linux build.
  • Accessing a Windows-based network using autofs, Samba & Co. works reliably. I've copied hundreds of gigabytes in a single session from disk to network and back, without so much as a hiccup. Things just work, even if everyone else around you is using Windows.
  • Performance of most of my tools is a lot better on Linux. I'm not sure whether it's the compiler or the kernel scheduler, but recently, I've been even benchmarking an OpenGL graphics application on Linux because it was simply running faster.
  • Installing tools is so much easier it's no longer funny. Everything you need is in the package manager, and if not, compiling from source is typically trivial.
  • Connecting remotely into your machine is so much easier if you end up at a console instead of having to play the RDP game. I also tried remote PowerShell for some time, but it's a far cry from the ease of use of ssh.
  • Alt+Click for window dragging, how can you live without it?

The bad

  • Debugging with Qt Creator is by far not as comfortable as with Visual Studio. For some reason or another, my code crashes gdb when compiled without any optimization whatsoever, with -Og it works, but the debugging experience is not as good as it should be. The debugger is also way slower than Visual Studio.
  • Debugging again, I miss the Visual Studio visualizers. I did write a few visualizers for Qt Creator, but they are really painful to write, don't integrate well, and debugging them is way harder than with Visual Studio 2013.
  • I can't get subpixel positioning to work in Firefox (or anywhere else, for that matter.) While I have proper anti-aliasing, every glyph is rendered exactly the same and the kerning is a bit off. On Windows, I simply force Firefox to use DirectWrite anti-aliasing for all font sizes and I'm done, but on Linux? Help! Anyone?
  • I have funny performance problems with scrolling in text editors. Sublime Text is especially guilty here. Scrolling with a few frames per second is really annoying.
  • V-Sync on the desktop is broken. I never managed to get my window manager to be V-synced but applications not. I.e. what I want is tear-free dragging of windows, but I want rendering in a window to run as fast as possible. It's okay if it blits to the window manager ever so often and I only see the sync frames. On Windows, this seems to work without issues, on Linux, I tried to get it running for a few days and I've given up. At least with my R9 290X, tearing while moving windows is no longer too noticeable, but video replay is really god awful sometimes. My hope is that Wayland will fix this, but it seems to be quite a few years away still.
  • LibreOffice interop is not good enough yet. On its own, it's excellent and enough for all practical purposes, but at work, I have to use Word, PowerPoint and Excel documents, and for those I typically RDP into a Windows machine and just work from there.

Overall, I'm really pleased with how the Linux experiment turned out so far. For development, it's pretty slick, except for the annoyances with debuggers. LLDB seems promising in this area -- and I already used it a few times where GDB would simply fail -- but a good graphical front-end with easy-to-write visualizers is definitely needed.

As a normal desktop, Linux does all right for me, except for the problems with font-rendering and tearing. Sure, this is "just polishing", but getting font-rendering and v-sync right is what made people really notice even if they can't pin-point it. Just look at the efforts mobile vendors go to provide smooth scrolling and good (high-DPI) font-rendering. I sure hope Wayland will help with the v-sync problems; regarding the fonts, everything is there already (FreeType can do sub-pixel positioning), but it'll take time until the UI toolkits will properly support it.

Image showing various font-rendering problems on Linux.
Font-rendering problems on Linux

That's it, and as you probably guessed, I'm going to stick with Linux for the foreseeable future :)

Managing results, or: What the hell did I do?

If you're in research, one problem you'll have at nearly every project is to manage your results. Most of the time, it's just a bunch of performance numbers, sometimes it's a larger set of measurements, and sometimes you have hundreds of thousand of data points you need to analyse. In this post, I'll explain how I do it. Without further ado, let's get started!

Measuring & storing: Big data for beginners

Measuring is often considered "not a problem" and done in the most simple manner possible. Often enough, I see people taking screenshots manually or copying numbers written in the console output. Those who do some structured output typically generate some kind of CSV and end up with hundred of text files in their file system.

Let me say this right away: Manual measurements are no measurements. Simply not worth the trouble. The chance that you'll do a copy-and-paste error will rapidly approach 1 once the deadline comes closer. So it has to be automated somehow, but how?

Let's take a look at the requirements first:

  • Data recording should be "crash"-safe. It's very likely that your machine will crash while recording the benchmarks, or that someone will cut your power, the disk will fail or your OS update system will chime in and kill your program. You want all results so far to be safely recorded and no partially written results.
  • Data should be stored in a structured way: Results which belong together should be stored together. Ideally, the results should be self-describing so you can reconstruct them at some point later in time without having to look at your source code.
  • Data must be easy to insert & retrieve: It should be easy to find a particular datum again. I.e. if you want all tests which used this combination of settings, it should be possible, and not require a lot of manual filtering.
  • Test runs must be repeatable. You need to store the settings used to run the test.

Sounds like a complex problem? Yes, it is, but fortunately, there is one kind of software which solves exactly the problems outlined above: Databases. My go-to database these days is MongoDB, which is a document-oriented database. Basically, you store JavaScript objects, and MongoDB allows you to run queries over the fields of these objects.

When I say JavaScript objects, I really mean JSON. MongoDB is a JSON storage engine, using a highly efficient, binary representation. The real power comes when you fully embrace JSON throughout your tools as well, in particular, if you store all settings you need to run a test as JSON and if your tools return JSON. This will allow you to store the test input and output together, making it trivial to recover after a crash. If the test settings are already present in the database, you simply skip them. In my case, I typically store a document with a 'settings' field and a 'results' field.

Data insertion and retrieval is very simple. My test runners are all written in Python, and with PyMongo, you can directly take the JSON generated by your tools and insert it into the database. Processing JSON is trivial with Python, as a JSON object is also a valid Python object (it's just a nested list or dictionary in Python), and with PyMongo, you basically can view your database as one large Python dictionary.

A huge advantage of using an NoSQL database like MongoDB over a SQL database is that you don't have to think about your data layout at all. With SQL, you have to put a few smarts into the table structure and you'll need different tables for different kinds of results. While it's not a huge problem, it still requires set-up time which you can avoid by using a document storage. Performance wise, I'd say just don't worry. Every database these days scales to huge data sets with millions of documents and fields with ease.

There's only one case where using a separate database might be problematic, and that's if you have to store gigabytes of data and very high speed and you can't afford the interprocess communication. In this case, using an embedded database like SQLite might be necessary. If you can, you should also try to run the database on a separate machine for extra performance and robustness.

For me, the complete process looks as following:

  • A Python runner generates a test configuration.
  • The Python runner checks if the test configuration is already present in the database. If not, execution continues.
  • Depending on the tool, the test configuration is passed as JSON or as command-line options to the benchmark tool.
  • The benchmark tool writes a JSON file with the profiling results.
  • The Python runner grabs the result file, combines it with the configuration and stores it to the database.

This process allows me to benchmark even super-unstable tools which crash on every run. As long as you spawn a new process for each individual test configuration, no invalid results can be ever generated.

So much for data generation. Let's take a look at the second part, analysis!

Analysis

For analysis, I'm using a great, Python-based combo: The new statistics module from Python 3.4, Matplotlib and NumPy.

The statistics module provides the basic functions like median and mean computation. For anything more fancy, you'll want to use NumPy. NumPy allows you to efficiently process even large data sets, and provides you with every statistics function you'll ever need.

Finally, you'll want to have all plotting to be fully automated. Matplotlib works great for this -- if you want a more graphical approach, you could also try Veusz. Matplotlib can automatically generate PDF documents which can be directly integrated into your publication.

Having everything automated at this stage is something you should really try to achieve. Chances are, you'll fix some bug in your code close to the deadline, and at that point, you have to re-run all affected tests and re-generate the result figures. If you have everything automated, great, but if you do it manually, you'll extremely like to make some mistake at this stage. Oh, and while automating, you should also generate result tables automatically and reference them from your publication -- again, copy & paste is your worst enemy!

One note about the analysis: Installing all the packages can be a bit tricky on Windows; as there is only an unofficial 64-bit NumPy installer. You can spare yourself some trouble by moving the analysis over to Linux, where you can get everything in seconds using pip. Moving the database is a matter of seconds using mongodump and mongorestore. Even better, just let the database run on a second machine and you can analyse while your benchmark is running. If you use SQLite, you can also just copy the files over.

That's it; I hope I could give you a good impression how I deal with results. I'm using the approach outlined above since a few years now to great effect. Now, every time it comes to measuring stuff, instead of "oh gosh, I hate Excel" you can hear me saying "hey, that's easy" :)