Programming tools

By popular demand, here's the list of tools and stuff I use regularly now:

  • Visual Studio 2012 with a custom, solarized light based color theme. I really like this a lot, and while it takes a while to get used to it, I find it hard to go back to high-contrast themes. Project files are all generated with CMake of course.
  • My programming font is Source Code Pro. It's super-easy to read, works fine at the sizes I like, and I can have a consistent look & feel on Windows, Linux and Mac OS X.
  • For serious scripting, I use Python 3 together with PyQt. In my applications however, I typically embed Lua if I want to quickly try out stuff.
  • For UI, I'm using Qt if it requires more than 5 widgets. If it's super simple (read: Single button window), I might hack it together in C# and WPF, but I usually go for Qt or PyQt.
  • My text editor of choice is Sublime Text 2 (and yes, I bought a license). I'm using the DirectWrite font rendering there and the CMake extension
  • For references, I use Zotero Standalone. The storage is set to a SkyDrive drive, so it automatically synchronizes across all my machines for free.
  • For source code management, I use Mercurial only. On Windows, in the form of TortoiseHg, on Linux as well, and on Mac OS X, I use SourceTree.
  • For taking notes and quick sketches, I use a graphics tablet and MyPaint.
  • Most graphics I need are done using Inkscape. Otherwise, it's Veusz, but I'm curious to try matplotlib.
  • For IM, I've recently switched from Pidgin to Instantbird. Instantbird is the basically the same, but I like the default theme a bit more, and it takes really a tiny amount of screen space.
  • For 3D graphics, I took the time to learn a bit more Blender. The learning curve is steep, but it's definitely worth the trouble. The hardest part for me was to unlearn Maya, which I usually used, but with some time and pratice Blender is just as good for the little 3D modelling I need. For occasionally viewing a mesh, I use usually my own tools now, or, in rare cases, MeshLab.
  • For screenshots, I'm completely sold on Screenshot Captor. It can capture Aero windows with transparency and shadows, and just works.
  • For video recording, I use Microsoft Expression Encoder 4 and Fraps (yes, with license). For batch processing of images I use ImageMagick, and for the video production it's Ava together with FFMPEG.
  • Documentation is now all done with Sphinx and Robin.
Just recently, I also started to give a mind mapping software a try, in my case Mindjet MindManager, as our university has some deal with them and we can get it for free. It seems pretty decent, but I'm not sure how much I'll be using in in the end. It certainly has the feeling of "oh gosh, this boosts my productivity by 12053%" at first, but it might turn out to be to cumbersome to use in the long run.

Sublime text is probably the most significant change that came this year. I actually forced myself to use a different editor than before, just to learn something new, and so far, I'm really happy about that. The other big change was learning Blender a bit better. I've delayed this a lot, but this year, I sat down a few weekends to seriously model something with it. On the programming side, I've done a bit of Javascript and Lua this year as part of my "do something new" stuff. I expect to spend some more time with both languages next year, so getting accustomed to them doesn't hurt.

That's for the tool side this year. What I'm still looking for is a better e-mail client, but the rest of the tools gets the work done. Well, and maybe a rewrite for Robin, to make it a bit faster and more robust.

Getting started with OpenCL, Part #3

The final part of the short introduction to OpenCL. In this part, we'll be using images and implement a simple blur filter. You should start with the base application written in the first part of this guide as we won't need anything from the SAXPY example.

We'll be loading an image and filtering it using a small gaussian blur filter. Instead of hard-coding the filter, we'll provide it as an input argument to the kernel, which allows you to easily change it at run-time. The kernel will then sample all pixels inside the filter radius, multiply them with the filter weight and finally write the new value to the output image.

OpenCL is well suited for this problems as it has direct support for images. Anything you would expect on images just works: You can read and write any pixel directly, the data can be automatically converted and they can be sampled using a bilinear filter. Creating images is just the same as with buffers, but it requires an additional parameter to describe the format of the image:

static const cl_image_format format = { CL_RGBA, CL_UNORM_INT8 };
cl_mem inputImage = clCreateImage2D (context,
    image.width, image.height, 0,
    const_cast<char*> ( ()),
CheckError (error);

This is the OpenCL 1.1 API, in OpenCL 1.2, you would use clCreateImage which uses an image descriptor so only one function is necessary instead of providing 5 different overloads. The important part here is the format, in particular, the second part: CL_UNORM_INT8. This indicates that the data is provided as 8-bit integers, and is stored as unsigned-normalized or unorm for short. Unsigned-normalized means that 0 maps to 0.0, and 255 maps to 1.0 when the data is read. It also requires you to read the data as floating point using read_imagef. This is perfect for our use case, as we have to weight each sample using a floating point value.

The second part that is necessary for reading an image is a sampler. It describes how coordinates are interpreted and whether the image data should be filtered. In our case, we want to index pixels using integers, so we have to set CLK_NORMALIZED_COORDS_FALSE, we want out-of-bounds accesses to be clamped (CLK_ADDRESS_CLAMP_TO_EDGE) and we don't want any filtering. Setting the filter to CLK_FILTER_NEAREST means that the sampler should return the value of the nearest pixel to the requested pixel, without any interpolation. With this, we can assemble the complete kernel now:

__constant sampler_t sampler =

float FilterValue (__constant const float* filterWeights,
    const int x, const int y)
    return filterWeights[(x+FILTER_SIZE) + (y+FILTER_SIZE)*(FILTER_SIZE*2 + 1)];

__kernel void Filter (
    __read_only image2d_t input,
    __constant float* filterWeights,
    __write_only image2d_t output)
    const int2 pos = {get_global_id(0), get_global_id(1)};

    float4 sum = (float4)(0.0f);
    for(int y = -FILTER_SIZE; y <= FILTER_SIZE; y++) {
        for(int x = -FILTER_SIZE; x <= FILTER_SIZE; x++) {
            sum += FilterValue(filterWeights, x, y)
                * read_imagef(input, sampler, pos + (int2)(x,y));

    write_imagef (output, (int2)(pos.x, pos.y), sum);

You should notice three things: First, we never define FILTER_SIZE in the code, second, we use get_global_id in two dimensions and third, we have a parameter specified as __constant. The missing FILTER_SIZE is easily explained: Instead of passing on the filter size into the kernel as a parameter, we will pass it using a #define. This means we will have to recompile the program if we change the filter size, but it also allows the compiler to easily unroll the innermost loop. To pass a definition to the compiler, simply add "-D FILTER_SIZE=1" to the list of options in the clBuildProgram call.

We use a 2D domain in this example as it naturally maps to the 2D image; there's no need to write a 1D to 2D mapping on our own. Finally, let's see what that __constant means. As we learned in the second part, OpenCL has multiple address spaces. In the second example, we only used __global. __constant is another address space where you can store read-only data. From the host side, it looks exactly the same as global memory, but some devices like GPUs do actually have special support for constant data which may result in better performance. That's also the reason why it can be fairly small. OpenCL only guarantees that it is at least 64 KiB in size.

The rest is business as usual: We create the images, run the kernel, and copy back the data. The example assumes that the image is stored as PPM; you can convert any image to and from PPM using GIMP. One minor difficulty arises from the fact that PPM is RGB only; OpenCL requires the image to be stored in RGBA, so we have to convert it. You should also use a larger test image, on a 4096 by 3264 pixels image, I can barely see all CPU cores working for a brief moment.

That's all, I hope this quick introduction gives you a basic understanding of OpenCL. If you have any questions, feel free to comment, and depending on interest, I might do another series about advanced OpenCL.

Getting started with OpenCL, Part #2

If you have followed along the guide so far, you should have a skeleton application ready which does not yet run any OpenCL code, but prepares everything that is necessary. So let's get a quick understanding how code is executed using OpenCL, and write our first OpenCL application. Make sure to grab the code before continuing to read.

Code that gets executed on a device is called a kernel in OpenCL. The kernels are written in a C dialect, which is mostly straightforward C with a lot of built-in functions and additional data types. For instance, 4-component vectors are a built-in type just as integers. For the first example, we'll be implementing a simple SAXPY kernel. A quick word of warning here, before we continue: This is an example! Running SAXPY using OpenCL is most likely much slower than running it directly, not because of OpenCL, but because SAXPY is purely limited by memory bandwidth and we'll be copying the data one additional time in this example. It may make sense if you are using data already on the device, but not if you have to copy a buffer from the CPU to the device and back.

That said, SAXPY is good to start as it is a simple and easy to understand kernel. Before we start, we have to take a quick look at how OpenCL sees the hardware. So far, we have been working on the host side only. In our case, the host application is written in C++ and runs on the CPU of the machine. The important part is that the host is calling the OpenCL C API functions to set up a queue and manage the data. The actual processing is done on the device. This may be also the CPU, but from the point of view of the host it doesn't matter. In fact, the code we're writing here will work just the same no matter if the device is a CPU or GPU.

The downside is that we must move data from the host to the device and back. This is necessary as the device can use a different memory space. GPUs for instance have on-board memory which is completely separate from the host memory. Before you ask: Yes, there are ways to optimize the copies when you are using a CPU device which is the same as the host, but for the sake of simplicity we will be copying data here.

What has to be done is thus: Allocate device memory, copy the data from the host to the device, set up a kernel, and copy the results back. Allocating memory is easily done using the buffer API:

cl_mem aBuffer = clCreateBuffer (context,
    sizeof (float) * (testDataSize), (), &error);
CheckError (error);

You can also specify that the memory should be copied directly while creating the buffer and that the copy should be blocking, that is, the command will only return once the operation has finished. All we need now is the kernel itself.

As mentioned before, a kernel is a piece of C code. Specifically, it's a single function marked with the __kernel modifier. In our example, the kernel looks like this:

__kernel void SAXPY (__global float* x, __global float* y, float a)
    const int i = get_global_id (0);

    y [i] += a * x [i];

The __global modifier indicates that this is global memory, or simply memory allocated on the device. OpenCL supports different address spaces; in this sample, we'll be using only global memory. You wonder probably where the SAXPY loop went; this is one of the key design decisions behind OpenCL, so let's understand first how the code is executed.

Unlike C, OpenCL is designed for parallel applications. The basic assumption is that many instances of the kernel are executed in parallel, each processing a single work item. Multiple work items are executed together as part of a work group. Inside a work group, each kernel instance can communicate with other instances. This is the only execution ordering guarantee that OpenCL gives you; there is no specified order how work-items inside a group are processed. The work-group execution order is also undefined. In fact, you cannot tell if the items are executed in parallel, sequential, or in random order. This freedom and the minimal amount of data exchange and dependencies between work items makes OpenCL so fast. Work-groups allow for some order as all items in a work-group can be synchronized. This comes in handy if you want to load for instance a part of an image into cache, process it, and write it back from cache. For our example at hand however, we will ignore work-groups completely.

In order to identify the kernel instance, the runtime environment provides an id. Inside the kernel, we use get_global_id which returns the id of the current work item in the first dimension. We will start as many instances as there are elements in our vector, and each work item will process exactly one entry.

Multiple kernels can be present in a single source file, called a program. We have to start by creating a program from the OpenCL source code, and then create the kernel from it. That sounds more complicated than it actually is:

CheckError (clBuildProgram (program, deviceIdCount, (), nullptr, nullptr, nullptr));

cl_kernel kernel = clCreateKernel (program, "SAXPY", &error);
CheckError (error);

The kernel is now ready to be used; all that is left is to bind the arguments to it. This is done using clSetKernelArg. With OpenCL 1.1, it's not possible to find out which argument has which name, so we have to set them by number:

clSetKernelArg (kernel, 0, sizeof (cl_mem), &aBuffer);
clSetKernelArg (kernel, 1, sizeof (cl_mem), &bBuffer);
static const float two = 2.0f;
clSetKernelArg (kernel, 2, sizeof (float), &two);

As mentioned above, we have to specify the size of the vector before we can run the kernel. That is, we have to tell OpenCL what dimension our work domain has and the extents in each dimension. This is done using clEnqueueNDRangeKernel:

const size_t globalWorkSize [] = { testDataSize, 0, 0 };
CheckError (clEnqueueNDRangeKernel (queue, kernel,
    1, // One dimension
    0, nullptr, nullptr));

This enqueues the kernel for execution. Finally, we need to get the results back, which is done using clEnqueueReadBuffer. We call it with the blocking parameter set to true, so the call will block until the kernel has finished working before the data is read back. We also didn't specify the work group size, so the implementation will use something suitable.

And voilà, our first OpenCL program has run! In the next and final part, we'll take a look at how to do image processing with OpenCL, to make it do something actually useful.

Getting started with OpenCL, Part #1

Welcome to a short series on how to get started with OpenCL. I assume that you are a developer, you know what OpenCL is and you want to get up to speed quickly. We'll be building a small example application with OpenCL which will eventually be able to apply a blur filter on an image. You can find the complete source code at Bitbucket. In this part, we'll prepare everything so we can actually use OpenCL. You should fetch the corresponding code to follow along easily. The second part covers how to run a simple kernel, and the third part does a slightly more complicated example where an image is processed.

First of all, a quick overview of how OpenCL actually works. OpenCL comes as a runtime environment and has to be installed on your target machine, no matter if you are using Windows or Linux. For Mac OS X, OpenCL is already part of the system, so there is nothing to install there. The runtime installs two things: First, a dispatch library and then the actual runtime containing the implementation. The dispatch library is necessary as there is typically more than one runtime present on a machine. For instance, there might be the AMD CPU runtime and an NVIDIA GPU runtime installed at the same time. The dispatcher makes sure that they don't overwrite each other.

To get started, you need to be able to link against this dispatch library, called OpenCL.dlld on Windows. Every OpenCL SDK comes with an OpenCL.libd which allows exactly for that. The most important OpenCL SDKs right now are:

  • The Intel SDK, which works on newer Intel CPUs and integrated graphics units from Intel. It comes with OpenCL 1.2 support.
  • The AMD APP SDK, which works on any CPU (both AMD and Intel) and on AMD GPUs. It supports OpenCL 1.2 as well.
  • The NVIDIA GPU SDK, which works on NVIDIA GPUs only and supports only OpenCL 1.1.

No matter which one you choose, you have to make sure that the OpenCL.libd is found by your linker and that the headers are in your include path. For the example project, I'll be using CMake, so we need a FindOpenCL.cmake which does the search. Feel free to grab the one from the repository. If you are not familiar with CMake, then take a look at the CMake tutorial first.

Assuming that OpenCL got found correctly, you can now include the OpenCL header it in your source code. The exact path depends on whether you are using Mac OS X or not:

#ifdef __APPLE__
    #include "OpenCL/opencl.h"
    #include "CL/cl.h"

That's all, OpenCL is completely contained in a single header. The API is written in C, and while there is a C++ wrapper, we'll be using the C API only for this example. Don't worry, it's not too hard or overly verbose. Before we get started with actual code, grab a copy of the specification. It contains a reference of all functions, error codes, and structures; making it very handy. In particular, the error codes for each function are well explained and make it easy to understand why a particular call fails.

Remember the dispatch library mentioned above which allows choosing between different implementations? This is exactly how we'll start. An implementation is called a platform in OpenCL; each platform can contain multiple devices. A device is where the code actually gets executed in the end; notice that each device in the same platform can support different features. For instance, if you have the AMD runtime installed and an AMD CPU and GPU, both will appear under the AMD platform but the GPU might support a different OpenCL version and extensions than the CPU. For this example, we'll use only core OpenCL 1.1 features, which are available virtually everywhere, but keep this in mind for the future.

We begin by querying what platforms are available:

cl_uint platformIdCount = 0;
clGetPlatformIDs (0, nullptr, &platformIdCount);

std::vector<cl_platform_id> platformIds (platformIdCount);
clGetPlatformIDs (platformIdCount, (), nullptr);

All OpenCL APIs follow the same scheme; in this case, we call the function first with an empty output to obtain the number of numbers. After that, we can fetch the platforms in a correctly sized buffer.

We don't care in particular about which platform we're going to use, so we continue directly by querying the devices for the first platform we've found. Notice how strikingly similar the code is:

cl_uint deviceIdCount = 0;
clGetDeviceIDs (platformIds [0], CL_DEVICE_TYPE_ALL, 0, nullptr,
std::vector<cl_device_id> deviceIds (deviceIdCount);
clGetDeviceIDs (platformIds [0], CL_DEVICE_TYPE_ALL, deviceIdCount, (), nullptr);

Now we have found a device, which means we can finally run some computation on it? Not so fast, to actually use it, two more things are needed. A context, which manages resources on a set of devices, and a command queue which executes the commands. This is separated  so you can create all resources up-front using only the context, and then create multiple queues on the same device to submit work from multiple threads.

Creating a context is straightforward:

const cl_context_properties contextProperties [] =
    reinterpret_cast<cl_context_properties> (platformIds [0]),
    0, 0

cl_context context = clCreateContext (
    contextProperties, deviceIdCount, (), nullptr,
    nullptr, &error);

From here on, we'll also check for errors. Every function that can fail in OpenCL returns an error code, either via an output parameter (if the function creates an object) or directly as the return value. We will check that this error value is set to CL_SUCCESSd, which indicates that the call worked and exit the application otherwise.

Remember that resources like the context have to be cleaned up later on, by using the appropriate release method. Every resource is reference counted, after the creation, it starts with one reference. The reference count can be increased using the clRetain*d methods and decreased using clRelease*d. There's no need to release the platform or the devices, but everything else must be cleaned up.

At this point, we have a context ready, now we need a queue as well, which is equally easy to create. If you followed along so far, you have done everything necessary to set up OpenCL and now you're ready to get some work done; we'll take a look at how to actually run code in the second part of this introduction.

OpenCL and DirectX/OpenGL interop

I'm a big fan of OpenCL, even though it has a few problems related to graphics left. Still, it's a viable target for graphics related computations, with the promise to have a single implementation of various compute kernels that can be used with both graphics APIs (DirectX and OpenGL.)

OpenCL had built-in interop support for OpenGL since day one with the cl_khr_ogl_sharing extension. The situation on the DirectX side was not that great, though. At the beginning, only NVIDIA provided interoperability support: cl_nv_d3d9_sharingcl_nv_d3d10_sharingcl_nv_d3d11_sharing. All the extensions are highly similar and provide a direct, low-level access to DirectX resources. AMD followed quickly for DirectX10 with  cl_khr_d3d10_sharing, which is a 1:1 copy of the NVIDIA extension. Unfortunately, there was no equivalent for DirectX11. While in theory it is possible to use the D3D10 extension with DirectX11 with a lot of voodoo and DirectX10/11 interop, a KHR version of the NVIDIA extension was clearly the way to go.

This took quite some time until OpenCL 1.2, which comes now with a standard cl_khr_d3d11_sharing extension. The OpenCL version is a slightly expanded version of the NVIDIA extension (for instance, it allows to disable the synchronization guarantees). NVIDIA does not support it though (as it doesn't support OpenCL 1.2), but AMD just recently implemented it in their driver. I didn't notice at first, as there was no public announcement and if you search the web for cl_khr_d3d11_sharing and AMD, you usually wind up with a post of mine where I shake my fist at AMD for not supporting it. However, thanks to Christophe Riccio I just discovered that since at least Catalyst 12.10 the driver exposes the DirectX11 KHR sharing extension. That still means two code paths for NVIDIA and AMD, but that's a small cost to pay for a much wider hardware support. In particular, current AMD hardware has advantages in terms of memory bandwidth (both to local as well as to global memory) which could be highly beneficial for a bunch of use cases I have.

If you're interested in OpenCL and DirectX11, feel free to try on any hardware now! Just keep in mind that this is not yet a fully debugged and stable system. At least on NVIDIA (haven't tried AMD's DirectX11/OpenCL support yet) we're running into driver synchronization bugs from time to time as well as weird kernel compile errors. Still, you can get some nice stuff working, and I'm looking forward to trying it all out on an AMD card now as well!