OpenCL for realtime rendering: What’s missing?

I’m a heavy user of OpenCL, relying on it exclusively for all my highly parallel computing needs. Recently, I started using OpenCL as a replacement for DirectCompute for a DirectX11 based renderer, and while it’s close, there is still a bunch of things missing. This list is sorted roughly in order of importance. Notice that all issues concern OpenCL 1.1/1.2, I do hope that future versions will resolve a bunch of them:

  1. No support for reading the depth buffer: Binding a depth buffer with 24-bit depth is not possible at all; binding a depth buffer with 32-bit depth stored as float still requires a copy between the depth buffer and a 32-bit float texture. This is just ridiculous, as the data is already on the GPU. Use cases for this are plenty: Every deferred shading implementation on the GPU wants access to the depth buffer to be able to compute the world space position. Being able to use a 32-bit depth texture would resolve 50% of the problems. The ideal case would be the ability to map (in DirectX parlance) DXGI_FORMAT_D24S8 and DXGI_FORMAT_R32_TYPELESS textures, the former because it provides best performance and the latter because it would allow to share the depth buffer between OpenCL and pixel shaders.
  2. No mip-mapped texture support: OpenCL only allows to bind a single mip-map level of an image. I would definitely like to bind a full mip-map chain, for instance, implementing a fast volume-raytracer is much easier if I can access a mip-mapped min/max texture for acceleration. Using global memory to emulate mip-mapped data structures results in reduced performance and super-ugly code, especially if interpolation is used. There is some hope that this will be added, as the cl_image_desc has already a field num_mip_levels. An immediate use case for me is the already mentioned volume rendering, but there’s also a lot of image filtering things where access to all mip-map levels would be very helpful; plus some other uses cases as well (for instance, updating a virtual texture memory page table.) Even worse, it can be done already today, with super-ugly code that binds each mip-map level to an image object.
  3. No offline kernel compiler: I have an application with lots of kernels, and the first start takes literally a minute or so while the kernels are compiled (on a dual-six-core machine — that’s longer than the application itself takes to compile/link.) This is bad “out-of-the-box” experience; and worse, the client machine can use a different driver/compiler which will result in errors I didn’t have on my machine. Precompiling into some intermediate format would readily solve this problem.
  4. No multi-sampled image support: Reading MSAA’ed image is a must have for high-quality rendering, writing would be nice but is not that crucial. Again, support seems to be coming, the cl_image_desc has also a field num_samples. The main use case I have in mind is high-quality deferred shading, where I would definitely like to use an MSAA’ed frame- and depthbuffer.
  5. No named kernel parameter access: While it is possible to work around this using clGetKernelArgInfo, having it built-in in the way OpenGL does it for uniforms would be nice (oh and defaults should be settable, I have a bunch of kernels where some parameters are there “just in case”, being able to set them to default values would be great and is easy to do once reflection is in place.) Unfortunately, this is OpenCL 1.2 only, so I couldn’t try it yet.

OpenCL 1.2 provides at least access to texture arrays, which should help with some rendering techniques (for instance, storing lots of shadow maps in an array instead of passing 16 parameters to the kernel does simplify a lot of code.)

The thing that annoys me the most is that the vendors must already have code lying around to do this, as DirectCompute has none of this limitations. That gives me some hope that implementing it in OpenCL won’t take forever; but it’s still an annoying state now where you have some stuff in CUDA/OpenCL/DirectCompute which is not supported everywhere, even though it runs on the same hardware/driver (and I seriously hope they don’t have everything separate in the driver backend.)

That said, there’s also a bunch of performance issues that needs to be resolved in the current runtimes. For instance, a kernel dispatch is still slower than a draw call — I ported some old pixel-shader style GPGPU code over to OpenCL, and while the code looks and feels the same, it’s a bit slower now on newer hardware. Plus the vendors need to get out new OpenCL drivers out faster. NVIDIA in particular delayed the OpenCL 1.1 drivers over a year. Folks, I’m still using OpenCL on NVIDIA because the hardware is good (and because there is D3D11 interop), and I’m not going to move to CUDA no matter how much you delay it. In the worst case, I’ll switch to AMD knowing that eventually NVIDIA will have to catch up.

Further down the road, there is no eco-system for OpenCL yet, so a bunch of libraries are missing:

  1. Sort: No good, BSD license or better sort libraries which have been tuned on different hardware.
  2. Linear solvers: Doing diffusion depth-of-field without a good, optimized linear solver … sucks.
  3. FFT: Ocean simulation, good bokeh: Not so good without a properly optimized FFT.

Well, back to work, that deferred OpenCL renderer doesn’t write itself :)

[Update] Why is it important to access the depth buffer directly? Because you benefit from the hardware compression during reads (reducing the required bandwidth.) This is even more important for multi-sampled buffers, as the hardware compression can do really wonders there. After copying to a normal texture, the compression is lost.

Related posts:

  1. Pixel aligned rendering in OpenGL, and direct state access

This entry was posted in Graphics, Programming and tagged , . Bookmark the permalink.

6 Responses to OpenCL for realtime rendering: What’s missing?

  1. Jacobo says:

    3. No offline kernel compiler

    You can compile your kernels, ask the API for the binaries, and the next time, upload the binaries directly, without having to recompilate them.
    Also you can download the CL compiler from the driver with API, to free its resources,

    So now, you can fix your article.

  2. Anteru says:

    Well, the problem I mention remains: If I compile on the target machine on first start, I hope that the compiler there does not have problems with my source code. Let’s assume I use a different driver version with a different compiler. Pre-compiling resolves this issue completely. Second, I have to compile the same code for all platforms/devices: At least CPU + GPU, and sometimes for two different GPUs in the same machine. That’s just stupid, and the different compiler problem becomes worse (I had multiple kernels which failed on one of {AMD, Intel, NVIDIA} but worked on the other two.)

    Second, you can do better analysis and optimization if compiling offline. Read the AMD slides on how they hacked up LLVM to compile faster, all of this doesn’t matter when compiling offline.

    So it’s not that easy, and the workarounds suck. There’s a good reason that DirectX comes with a compiler; never ever had problems loading a compiled shader; but compiling on the target machine was always tricky, in particular if different SDK versions were used.

  3. Pingback: Geeks3D Programming Links – December 06, 2011 - 3D Tech News and Pixel Hacking - Geeks3D.com

  4. Anjul says:

    I can understand why you moved from DirectCompute to OpenCL, but why is CUDA off your list?

  5. Anteru says:

    Well, two reasons: First of all, CUDA is not so much faster than OpenCL unless you use a lot of hardware specific trickery (like assuming a particular warp-size, etc.) That’s all fine if you’re doing a single publication which is supposed to run fast. If you are mostly interested in it running at all, then you also start writing more or less general CUDA which doesn’t look much different from OpenCL any more. For the stuff I’m working on now — which is mostly content creation — getting the utmost performance doesn’t matter. So from this side, I don’t see a point in using CUDA. Don’t get me wrong, I still believe there are valid use cases for CUDA, but it doesn’t have a compelling advantage over OpenCL for me.

    Second, lots of the stuff I’m doing does run quite well on CPUs. We have a volume raycaster and various image processing stuff which runs just as fine on a dual-six-core as it does on a 560 for instance, and it’s pretty cool if you can offload some work without changing your code (and I totally expect a dual-8-core Sandy Bridge with 100 GiB/s memory bandwidth to be even faster at some stuff.) Some of the processing stuff may also run on machines without a DX11 GPU (we still have some DX10 hardware flying around), where it automatically switches to the CPU with OpenCL. Lower performance, of course, but it “just works”. Yes, I know there is a CUDA to x86 compiler right now, but I like that Intel is providing an optimized OpenCL runtime for Intel CPUs and AMD one for their, etc. instead of relying on PGI to do a good job across all hardware. Plus, the OpenCL runtime is free. I guess that if I had already a large existing body of CUDA code, I would bite the bullet and buy the CUDA->x86 compiler, but truth to be told, I started serious GPU computing with OpenCL (on an AMD 5870); the first time I really used CUDA was quite a bit later.

    Do you see a striking advantage CUDA has over OpenCL? For graphics interop, they are both quite limited still … you can’t share a depth buffer in CUDA 4.1 either, no mip-maps as well and there is no multi-sampled texture access, so except for the offline compiler CUDA 4.1 is just as good/bad for graphics as OpenCL 1.2.

  6. Dominik says:

    Hey, apart from your realtime stuff, if you may want to try something out rapid prototyping style (even in MATLAB … :) ), maybe you want to have a look at http://viennacl.sourceforge.net/. Its quite nice, but some features are still missing (e.g. cl_image).

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>