OpenCL for realtime rendering: What's missing?

December 06, 2011

approximately 4 minutes to read

I’m a heavy user of OpenCL, relying on it exclusively for all my highly parallel computing needs. Recently, I started using OpenCL as a replacement for DirectCompute for a DirectX11 based renderer, and while it’s close, there is still a bunch of things missing. This list is sorted roughly in order of importance. Notice that all issues concern OpenCL 1.1/1.2, I do hope that future versions will resolve a bunch of them:

No support for reading the depth buffer: Binding a depth buffer with 24-bit depth is not possible at all; binding a depth buffer with 32-bit depth stored as float still requires a copy between the depth buffer and a 32-bit float texture. This is just ridiculous, as the data is already on the GPU. Use cases for this are plenty: Every deferred shading implementation on the GPU wants access to the depth buffer to be able to compute the world space position. Being able to use a 32-bit depth texture would resolve 50% of the problems. The ideal case would be the ability to map (in DirectX parlance) DXGI_FORMAT_D24S8 and DXGI_FORMAT_R32_TYPELESS textures, the former because it provides best performance and the latter because it would allow to share the depth buffer between OpenCL and pixel shaders.
No mip-mapped texture support: OpenCL only allows to bind a single mip-map level of an image. I would definitely like to bind a full mip-map chain, for instance, implementing a fast volume-raytracer is much easier if I can access a mip-mapped min/max texture for acceleration. Using global memory to emulate mip-mapped data structures results in reduced performance and super-ugly code, especially if interpolation is used. There is some hope that this will be added, as the cl_image_desc has already a field num_mip_levels. An immediate use case for me is the already mentioned volume rendering, but there’s also a lot of image filtering things where access to all mip-map levels would be very helpful; plus some other uses cases as well (for instance, updating a virtual texture memory page table.) Even worse, it can be done already today, with super-ugly code that binds each mip-map level to an image object.
No offline kernel compiler: I have an application with lots of kernels, and the first start takes literally a minute or so while the kernels are compiled (on a dual-six-core machine – that’s longer than the application itself takes to compile/link.) This is bad “out-of-the-box” experience; and worse, the client machine can use a different driver/compiler which will result in errors I didn’t have on my machine. Precompiling into some intermediate format would readily solve this problem.
No multi-sampled image support: Reading MSAA’ed image is a must have for high-quality rendering, writing would be nice but is not that crucial. Again, support seems to be coming, the cl_image_desc has also a field num_samples. The main use case I have in mind is high-quality deferred shading, where I would definitely like to use an MSAA’ed frame- and depthbuffer.
No named kernel parameter access: While it is possible to work around this using clGetKernelArgInfo, having it built-in in the way OpenGL does it for uniforms would be nice (oh and defaults should be settable, I have a bunch of kernels where some parameters are there “just in case”, being able to set them to default values would be great and is easy to do once reflection is in place.) Unfortunately, this is OpenCL 1.2 only, so I couldn’t try it yet.

OpenCL 1.2 provides at least access to texture arrays, which should help with some rendering techniques (for instance, storing lots of shadow maps in an array instead of passing 16 parameters to the kernel does simplify a lot of code.)

The thing that annoys me the most is that the vendors must already have code lying around to do this, as DirectCompute has none of this limitations. That gives me some hope that implementing it in OpenCL won’t take forever; but it’s still an annoying state now where you have some stuff in CUDA/OpenCL/DirectCompute which is not supported everywhere, even though it runs on the same hardware/driver (and I seriously hope they don’t have everything separate in the driver backend.)

That said, there’s also a bunch of performance issues that needs to be resolved in the current runtimes. For instance, a kernel dispatch is still slower than a draw call – I ported some old pixel-shader style GPGPU code over to OpenCL, and while the code looks and feels the same, it’s a bit slower now on newer hardware. Plus the vendors need to get out new OpenCL drivers out faster. NVIDIA in particular delayed the OpenCL 1.1 drivers over a year. Folks, I’m still using OpenCL on NVIDIA because the hardware is good (and because there is D3D11 interop), and I’m not going to move to CUDA no matter how much you delay it. In the worst case, I’ll switch to AMD knowing that eventually NVIDIA will have to catch up.

Further down the road, there is no eco-system for OpenCL yet, so a bunch of libraries are missing:

Sort: No good, BSD license or better sort libraries which have been tuned on different hardware.
Linear solvers: Doing diffusion depth-of-field without a good, optimized linear solver … sucks.
FFT: Ocean simulation, good bokeh: Not so good without a properly optimized FFT.

Well, back to work, that deferred OpenCL renderer doesn’t write itself :)

Update: Why is it important to access the depth buffer directly? Because you benefit from the hardware compression during reads (reducing the required bandwidth.) This is even more important for multi-sampled buffers, as the hardware compression can do really wonders there. After copying to a normal texture, the compression is lost.