Using right/left-handed viewing systems with both DirectX & OpenGL

One problem many 3D graphics programmer constantly run into is left/right-handed view matrices. In many cases, beginners get stuck with left-handed coordinate systems because they start with DirectX. Worse of all, some sources on the web claim that DirectX somehow mandates a left-handed coordinate system which leaves beginners even more puzzled.

So let's take a look at how much truth is in this claim, by trying to derive how to use a right-handed coordinate system with DirectX. In particular, we want to be able to use exactly the same matrices as for OpenGL, i.e. a view matrix which is looking down the negative z-axis and a projection matrix which works with this view.

Before we start, keep in mind that the graphics hardware or API doesn't care at all what your chirality your coordinate systems have. All they expect is that the depth values are in the correct range (for OpenGL, -1..1 and for DirectX, 0..1) and some order to determine if a face of a triangle is back-facing. That's all, as long as depth values in the correct range and order will be produced, your can use anything you want.

So let's try to use a right-handed view and projection with DirectX. We have to make sure that both are indeed right-handed (i.e. not mixing the "handedness". There's a good post describing the possible issues in that case) -- in the simplest case, we can directly use the matrices generated by gluLookAt and gluPerspective (there's lots of source code around for how those are implemented.) Using those, we have now to resolve two problems:

  • DirectX uses a 0..1 z-Range, while OpenGL uses -1..1
  • The default DirectX triangle winding is "left-handed"

Let's tackle the problems one by one. We can solve the first problem easily with a scale matrix \(S\) which scales the depth range by 0.5 and a bias matrix \(B\) which translates depth by 1 after the projection:

\[S=\begin{bmatrix}1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & 0.5 & 0\\ 0 & 0 & 0 & 1\end{bmatrix}\]

\[B=\begin{bmatrix}1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 1\\ 0 & 0 & 0 & 1\end{bmatrix}\]

All we need to do now is to apply the projection matrix \(P\) first, and then \(S\times B\), i.e. the total projection matrix is \(S\times B\times P\). We can use now an OpenGL-style projection matrix \(P\) with DirectX as the depth range is now correctly mapped. However, if we use back-face culling, we will notice that we cull exactly the opposite faces, which brings us to our second problem.

The graphics APIs define the front-face by the vertex order. For DirectX, if you have a triangle with three vertices a,b,c, then the triangle is facing towards you if the normal (computed by the cross product of the edges \(b-a,c-a\)) points towards you. However, DirectX assumes per default a left-handed coordinate system, so you must use the "left-hand" rule for the normal. This is of course opposite now to the right-handed view we're using, but it can be trivially fixed. When creating a rasterizer state, set FrontCounterClockwise to true and now everything behaves consistently.

There's one problem with the scale/bias matrix approach though, which is numerical precision. Projections already have notorious precision problems, and if we work on the coordinates after the projection, precision is going to get even worse. However, we can factor in the output depth range directly into the projection matrix:

\[\begin{bmatrix} s_x & 0 & 0 & 0\\ 0 & s_y & 0 & 0\\ 0 & 0 & \frac{d_f * z_f - d_n * z_n}{z_n - z_f} & \frac{(d_f - d_n)(z_n * z_f)}{z_n - z_f}\\ 0 & 0 & -1 & 0 \end{bmatrix}\]

Here, \(s_x, s_y\) are the aspect ratio dependent scale factors (\(s_y = \cot(\text{fov}), s_x = s_y / r_a\)) and \(d_n, d_f\) are the depth values of the near and far plane after the projection (for DirectX, use \(d_n = 0, d_f = 1\), for OpenGL, use \(d_n = -1, d_f = 1\).) This is exactly the same matrix I use for both DirectX and OpenGL without any modifications whatsoever.

That's all, there isn't any more magic involved!

[Update]: Fixed the combined depth-range projection matrix, thanks Marc!

OpenCL for realtime rendering: What's missing?

I'm a heavy user of OpenCL, relying on it exclusively for all my highly parallel computing needs. Recently, I started using OpenCL as a replacement for DirectCompute for a DirectX11 based renderer, and while it's close, there is still a bunch of things missing. This list is sorted roughly in order of importance. Notice that all issues concern OpenCL 1.1/1.2, I do hope that future versions will resolve a bunch of them:

  1. No support for reading the depth buffer: Binding a depth buffer with 24-bit depth is not possible at all; binding a depth buffer with 32-bit depth stored as float still requires a copy between the depth buffer and a 32-bit float texture. This is just ridiculous, as the data is already on the GPU. Use cases for this are plenty: Every deferred shading implementation on the GPU wants access to the depth buffer to be able to compute the world space position. Being able to use a 32-bit depth texture would resolve 50% of the problems. The ideal case would be the ability to map (in DirectX parlance) DXGI_FORMAT_D24S8 and DXGI_FORMAT_R32_TYPELESS textures, the former because it provides best performance and the latter because it would allow to share the depth buffer between OpenCL and pixel shaders.

  2. No mip-mapped texture support: OpenCL only allows to bind a single mip-map level of an image. I would definitely like to bind a full mip-map chain, for instance, implementing a fast volume-raytracer is much easier if I can access a mip-mapped min/max texture for acceleration. Using global memory to emulate mip-mapped data structures results in reduced performance and super-ugly code, especially if interpolation is used. There is some hope that this will be added, as the cl_image_desc has already a field num_mip_levels. An immediate use case for me is the already mentioned volume rendering, but there's also a lot of image filtering things where access to all mip-map levels would be very helpful; plus some other uses cases as well (for instance, updating a virtual texture memory page table.) Even worse, it can be done already today, with super-ugly code that binds each mip-map level to an image object.

  3. No offline kernel compiler: I have an application with lots of kernels, and the first start takes literally a minute or so while the kernels are compiled (on a dual-six-core machine -- that's longer than the application itself takes to compile/link.) This is bad "out-of-the-box" experience; and worse, the client machine can use a different driver/compiler which will result in errors I didn't have on my machine. Precompiling into some intermediate format would readily solve this problem.

  4. No multi-sampled image support: Reading MSAA'ed image is a must have for high-quality rendering, writing would be nice but is not that crucial. Again, support seems to be coming, the cl_image_desc has also a field num_samples. The main use case I have in mind is high-quality deferred shading, where I would definitely like to use an MSAA'ed frame- and depthbuffer.

  5. No named kernel parameter access: While it is possible to work around this using clGetKernelArgInfo, having it built-in in the way OpenGL does it for uniforms would be nice (oh and defaults should be settable, I have a bunch of kernels where some parameters are there "just in case", being able to set them to default values would be great and is easy to do once reflection is in place.) Unfortunately, this is OpenCL 1.2 only, so I couldn't try it yet.

OpenCL 1.2 provides at least access to texture arrays, which should help with some rendering techniques (for instance, storing lots of shadow maps in an array instead of passing 16 parameters to the kernel does simplify a lot of code.)

The thing that annoys me the most is that the vendors must already have code lying around to do this, as DirectCompute has none of this limitations. That gives me some hope that implementing it in OpenCL won't take forever; but it's still an annoying state now where you have some stuff in CUDA/OpenCL/DirectCompute which is not supported everywhere, even though it runs on the same hardware/driver (and I seriously hope they don't have everything separate in the driver backend.)

That said, there's also a bunch of performance issues that needs to be resolved in the current runtimes. For instance, a kernel dispatch is still slower than a draw call -- I ported some old pixel-shader style GPGPU code over to OpenCL, and while the code looks and feels the same, it's a bit slower now on newer hardware. Plus the vendors need to get out new OpenCL drivers out faster. NVIDIA in particular delayed the OpenCL 1.1 drivers over a year. Folks, I'm still using OpenCL on NVIDIA because the hardware is good (and because there is D3D11 interop), and I'm not going to move to CUDA no matter how much you delay it. In the worst case, I'll switch to AMD knowing that eventually NVIDIA will have to catch up.

Further down the road, there is no eco-system for OpenCL yet, so a bunch of libraries are missing:

  1. Sort: No good, BSD license or better sort libraries which have been tuned on different hardware.

  2. Linear solvers: Doing diffusion depth-of-field without a good, optimized linear solver ... sucks.

  3. FFT: Ocean simulation, good bokeh: Not so good without a properly optimized FFT.

Well, back to work, that deferred OpenCL renderer doesn't write itself :)

[Update] Why is it important to access the depth buffer directly? Because you benefit from the hardware compression during reads (reducing the required bandwidth.) This is even more important for multi-sampled buffers, as the hardware compression can do really wonders there. After copying to a normal texture, the compression is lost.

C++, ownership and shared_ptr

I'm a big fan of C++ shared_ptr -- or I should rather say, I used to be a big fan. Lately I ran into some issues where the "shared ownership" model promoted by liberate use of shared_ptr started to make the code more complex and error prone. Let's start with the actual code in question and take a look at a a possible solution.

A bit of context before we start: The code sample will consist of two classes, Draw2D and Draw2DText. The classes themselves don't matter too much, but the relationship is important. On creation, Draw2D allocates some hardware context in order to draw elements. This hardware context should be released when the Draw2D instance is destroyed to free resources. All text instances created by a particular Draw2D instance are bound to that hardware context. That means that once the Draw2D instance is destroyed, all elements created by it are practically useless as the only operation that can be performed on them safely is destruction.

An additional complication is not crucial to the discussion, but is necessary to understand the first design. The Draw2D instance must be able to iterate over all items it has created when the window size changes to re-layout them. I know that this can be solved with container classes and callbacks, but for the sake of simplicity there is only Draw2D and the Draw2DText, so Draw2D must keep some kind of link to its children. We will have to take care of this for the first design, but it will follow "naturally" for the later designs.

Without much further ado, the original implementation:

class Draw2D
{
    std::shared_ptr<Draw2DText> CreateText (...)
    {
        auto text = new Draw2DText;
        auto ptr = std::shared_ptr<Draw2DText> (text,
            std::bind (&Draw2D::Release, this, _1));

        // Add ptr to elements_
        return ptr;
    }

    void Draw (const Draw2DText& text)
    {
        // Use the Draw2D state to draw the text
    }

    void Update ()
    {
        // Loop over all elements, call element->Update()
        // If element is expired, add to free list
        // Remove all items from the free list
    }

    void Release (Draw2DText* e)
    {
        // Find e in elements_
        // Remove the corresponding pointer if found
    }

    std::vector<std::weak_ptr<Draw2DText>> elements_;
};

This approach does work, but has a few hidden drawbacks:

  • Probably the smallest problem, but by default, a lot of shared_ptr copying happens here, which is not really "free" due to lots of atomic instructions. The real problem is deeper: We're doing as-if the shared_ptr has ownership of the object, while it hasn't.
  • The Draw2D instance must perform garbage collection on its end, even though it is responsible for its created items. There is an ownership relation between Draw2D and its children, which is nowhere to be found in the code.
  • We're creating a function object per shared_ptr which doesn't do much interesting stuff.
  • Iterating over the items requires us to lock before use and keep track of the items ready for garbage collection (yes, we can remove them while iterating, but it's still something we have to handle.)

Overall, the problem stems from the fact that while there are many good uses for shared_ptr, expressing ownership is not. In this case, the Draw2D instance "owns" its Draw2DText instances as those become invalid when the Draw2D instance is destroyed. Moreover, the client, who has ownership of the Draw2D instance has clearly knowledge when he lets loose of the Draw2DText instances as the current design already does not allow using the Draw2DText instances after the Draw2D instance is destroyed.

Ok, so what can do? The first change we can add is to destroy the Draw2DText instances from the Draw2D destructor. This is possible as the shared_ptr goes through the Release indirection, which means we can safely remove the objects without notifying the shared_ptr. With this change in place, the user cannot use the text elements for sure as they are now pointing to invalid memory, but the user couldn't use them before either. In addition, we can easily check if the user tries to use a stale object by checking first if the object is registered with this particular Draw2D instance.

Let's look closer now at what we have: All child instance lifetimes are now directly bound to the owner, so we actually don't have to free them manually unless we choose to do so. This is good, but during the lifetime of the Draw2D instance we now force the user to do manual memory management.

Can we get the same thing easier? Yes, we can, by ditching all shared_ptr usage in Draw2D/Draw2DText and only provide them as an option. Let's see, if we use plain pointers now, we are still able to delete all instances when the Draw2D instance is destroyed. Nothing lost here. Giving out plain pointers to the user indicates that there is an ownership question that the user should look up in the documentation. Fortunately, if the user does not read it, we're not leaking memory still as we are going to release the memory at the end. This leaves us with a single problem: The user cannot use the comfort provided by the old system. This can be easily resolved though by introducing a custom function which produces exactly the same shared_ptr as before, i.e. with a custom deleter calling Release (). The new implementation:

class Draw2D
{
    ~Draw2D ()
    {
        // Loop over all elements, delete element
    }

    Draw2DText* CreateText (...)
    {
        auto text = new Draw2DText;
        // No further magic required here
        // Add text to elements_
        return text;
    }

    void Draw (const Draw2DText& text)
    {
        // Use the Draw2D state to draw the text
    }

    void Update ()
    {
        // Loop over all elements, call element->Update()
        // No free list management here
    }

    void Release (Draw2DText* e)
    {
        // Find e in elements_
        // Remove the corresponding pointer if found
    }

    std::vector<Draw2DText*> elements_;
};

Assuming that Draw2D is marked as non-copyable, it's just as safe to use as before. The ownership is more visible than before while the user can opt-in to a slow, but simple reference counting via shared_ptr.

As we have seen, shared_ptrs don't always lead to simpler or easier to understand code, and in some cases, they actually obfuscate the relationship and lifetime between objects. Its also interesting to see that this case couldn't have been handled easily in managed languages: Either you would wind up with circular references (Draw2D to Draw2DElement and back), which only get released once all references to the Draw2D instance and it's children is unreachable, or you have an explicit method like "Dispose" on the Draw2D instance which releases the underlying hardware context and simply renders everything unusable.

I'm curious if there is a better way to design this code to completely resolve this issue. Of course, keeping the hardware context alive would be the easiest, but we explicitly want to allow the user to manage the lifetime of this manually. Given the constraints, I have the strong feeling that this approach is the way to go.

Having implemented this, I'm also starting to get the impression that explicit lifetime management is often easier than expected. In many cases, there is a clear relationship between objects. As long as the release method is explicitly tied to the parent object, it is always clear when the lifetime of an object ends. While I'm still using shared_ptr in many places, I'm more aware of the potential problems they bring.

Ava source code is now public

A dear reader asked whether the Ava source code is public, and yes, it is now! Ava is a small video-processing tool I've been using for producing videos for publications (I have already blogged everything interesting about it.) It took some time, but I finally came around to add a bunch of comments and slap the BSD license on it. You can grab it fresh from the Bitbucket repository.

The tool itself is really really small, so please don't expect Nuke/Premiere here :)