## Debugging D3D12 fences & queues

Welcome to a hands-on session with DirectX 12. I was recently made aware by Christian of a synchronization problem in my D3D12 sample which required multiple tries to fix (thanks again for reporting this!). The more interesting part is however how to find it without doing a very close code review like Christian did, but by using some tools.

## The setup

If you want to follow along, make sure to check out the repository at revision 131a28cf0af5. I don’t want to give away too much in one go, so we’ll assume right now there is some synchronization issue and we’ll debug it step-by-step. Let’s start with taking a look using the Visual Studio Graphics Diagnostics. For this, you need to install the Graphics Tools in Windows 10 — Visual Studio should prompt you to get them when you start the graphics debugging.

Without further ado, let’s start the GPU usage analysis. You can find it under “Debug”, “Start diagnostic tools without debugging”, “GPU Usage”. After the application ends, you should see something like this:

Let’s select a second or so and use the “view details” button on this. The view you’ll get should be roughly similar to the output below.

That’s a lot of things going on. To find our application, just click on one of the entries in the table below, and you should find which blocks belong to our application. In my case, I get something like this:Ok, so what do we see here? Well, the CPU starts after the GPU finishes, with some delay. Also, the GPU 3D queue is very empty, which is not surprising as my GPU is not really taxed with rendering a single triangle Due to the fact that we’re running VSync’ed, we’d expect to be waiting for the last queued frame to finish before the CPU can queue another frame.

Let’s try to look at the very first frame:

Looks like the CPU side is only tracked after the first submission, but what is suspicious is that the GPU frame time looks like a single frame was rendered before the CPU was invoked again. We’d expect the CPU side to queue up three frames though, so the first frame time should be actually three times as long. Can we get a better understanding of what’s happening?

## GPUView

Yes, we can, but we’ll need another tool for this – GPUView. GPUView is a front-end for ETW, the built-in Windows event tracing, and it hasn’t gotten much love. To get it, you need to install the “Windows Performance Toolkit”. Also, if you use a non-US locale, you need to prepare an user account with en_US as the locale or it won’t work. Let’s assume you have everything ready, here’s the 1 minute guide to use it:

1. Fire up an administrator command prompt
2. Go to C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\gpuview>
4. type in log m, Alt+Tab to your application
5. Let it run a second or two, Alt+Tab back, and type log
6. Run GPUView on the Merged.etl file.

Just like in the Visual Studio graphics analysis tool, you’ll need to select a few milliseconds worth of time before you can make any use of the output. I zoomed in on three frames here.

Notice the color coding for each application is random, so here my sample got dark purple. We can see it executing on the 3D queue, and at the bottom, we see the CPU submission queue.

You’ll notice that suspiciously, just while the GPU is busy, the CPU queue is completely empty. That doesn’t seem right – we should have several frames queued up, and the moment the GPU starts working (this is right after the VSync, after all!), we should be queuing up another frame.

Let’s take a look at the present function. Conceptually, it does:

1. Call present
2. Advance to the next buffer
3. Signal a fence for the current buffer

At the next frame start, we’ll wait for the buffer associated with the current queue slot, which happens to be the slot we just used! This means we’re waiting for the last frame to finish before we issue a new one, draining the CPU queue, and that’s what we see in the GPUView output. Problem found! Fortunately, it’s a simple one, as the only thing we need to change is to wait for the right fence. Let’s fix this (and also the initial fence values, while we’re at it) and check again with GPUView.

Looks better, we see a present packet queued and some data after it. Let’s zoom really close on what happens during the rendering.

What do we have here? Two present packets queued up, while the GPU is processing the frame. Here we can also see how long it takes to queue up and submit the data to the GPU. Notice that the total time span we’re looking at here is in the order of 0.5 ms!

So finally, we fixed the problem and verified the GPU is no longer going idle but instead, the CPU queue is always nicely filled. While in this example, we’re limited by VSync, in general you always want to keep the GPU 100% busy which requires you to have one more frame worth of work queued up. Otherwise, the GPU will wait for the CPU and vice versa, and even a wait of 1 ms on a modern GPU is something in the order of 10 billion FLOPs wasted (in my example, on an AMD Fury X, we’re talking about 8601600000 FLOPs per ms!) That’s a lot of compute power you really want to throw at your frame

## 5 years of data processing: Lessons learned

During my thesis work, I had to process lots of data. Many meshes I worked on contained hundreds of millions of triangles, and the intermediate and generated outputs would typically range in the tens to hundreds of GiB. All of this means that I had to spend a significant amount of time on “infrastructure” code to ensure that data processing remained fast, reliable and robust.

My work also required me to create many different tools for specific tasks. Over the years, this led to additional challenges in the area of tool creation and tool discovery. In this blog post, I’ll take a look at the evolution of my tool infrastructure over roughly five years, and the lessons I learned.

## Processing overview

Data processing includes tasks like converting large data sets into more useful formats, cleaning up or extracting data, generate new data and finally modify data. We can group the tools into two broad categories. The first category reads an input and generates an output, while the second mutates the existing data in some way. A typical example for category one is a mesh converter; for category two, think for instance of a tool which computes smooth vertex normals.

Why is it important to make that distinction? Well, in the early days, I did have two kinds of tools. Those in the second category would typically read and write the same file, while those in the first category had well defined inputs and outputs. The idea was that tools which fall into the second category would wind up being more efficient by working in-place. For instance, a tool which computes a level-of-detail simplification of a voxel mesh simply added the level-of-detail data into the original file (as the tool which consumed the data would eventually expect everything to be in a single file.)

## Mutating data

Having tools which mutate files lead to all sorts of problems. The main problem I ran into was the inability to chain tools, and the fact that I would often have to regenerate files to undo the mutation. Point in case, the level-of-detail routine would sometimes create wrong blocks, and those can’t be easily fixed by re-running the tool with a special “replace” flag. Instead, I had to wipe all level-of-detail data first, followed by re-rerunning the tool again. And that was after I had fixed all bugs which would damage or replace the source data.

## Towards functional tools

Over the years, I refactored and rewrote all tools to be side-effect free. That is, they have read-only input data and write one or more outputs. Turns out, this made one optimization mandatory for all file-formats I used: The ability for seek-free, or at least seek-minimal reading. As I mentioned before, the original reason for mutating data in-place was performance. By writing into the same file, I could avoid having to copy over data which was taking a long time for the large data sets I had to work with.

The way I solved this was to ensure that all file formats could be read and written with near-perfect streaming access patterns. Rewriting a file would then be just as fast as copying, and also made processing faster in many cases, to the point that “in-place” mutation was no longer worth it. The biggest offender was the level-of-detail creation, which previously wrote into the same file. Now, it wrote the level-of-detail data into a separate file, and if I wanted to have them all together again, I had to merge them which was only practical once the read/write speed was close to peak disk I/O rates.

At the end, the changes to the file formats to make them “stream-aware” turned out to be quite small. For some things like the geometry streams, they were streams to start with, and for the voxel storage which was basically a filesystem-in-a-file all functions were modified to return entries in disk-offset order. For many clients, this change was totally transparent and immediately improved throughput close to theoretical limits.

## Tool creation & discovery

After several years, a big problem I ran into was tool discovery. I had dozens of command-line tools, with several commands each and with lots of command-line options. Figuring out which ones I had and how to use them became an increasingly complicated memory-game. It also increased the time until other users would become productive with the framework as tools were scattered around in the code base. I tried to document them in my framework documentation, but that documentation would rarely match the actual tool. The key issue was that the documentation was in a separate file.

Similarly, creating a new tool would mean to create a new project, add a new command, parse the command-line and call a bunch of functions. Those functions were in the tool binary and could not be easily reused. Moving them over to libraries wasn’t an option either, as these functions were typical library consumers and very high-level. And finally, even if I had them all as functions in a library, I would still need a way to find them.

The solution was to implement a new way for tool creation which also solved the tool discovery problem. This turned out to be an exercise in infrastructure work. The key problem was to balance the amount of overhead such that the creation of a tool doesn’t become too complicated, but I still get the benefits from the infrastructure.

What I ended up with was levering a lot of my framework’s “high-level” object classes, run-time reflection and least-overhead coding. Let’s look at the ingredients one-by-one: In my framework, there’s a notion of an IObject quite similar to Java or C#, with boxing/unboxing of primitive types. If I could somehow manage to restructure all tool inputs & outputs to fit into the object class hierarchy, this would have allowed me to use all of the reflection I already had in place. Turns out that because the tools are called infrequently, and because inputs are typically files, strings, numbers or arrays, moving this into a class-based, reflection-friendly approach wasn’t too hard.

Now I just had to solve the problem how to make a tool easy to discover. For each tool, I need to store some documentation along it. Storing the tool description and documentation separately turned out to be a fail. The solution I ended up with was to embed the declarative part as SJSON right into the source file.

Let’s take a look at a full source file for a tool which calls a vertex-cache index optimizer for a chunk:

#include "OptimizeIndices.h"

#include "niven.Geometry.VertexCacheOptimizer.h"

namespace niven {
///////////////////////////////////////////////////////////////////////////////
struct OptimizeIndicesProcessor final : public IGeometryStreamProcessor
{
OptimizeIndicesProcessor ()
{
}

private:
bool ProcessChunkImpl (const GeometryStream::Chunk& input,
GeometryStream::Chunk& output) const
{
if (input.GetInfo ().HasIndices ()) {
const int indexCount = input.GetInfo ().GetIndexCount ();

HeapArray indices (indexCount);
std::copy (
input.GetIndexDataArrayRef ().begin (),
input.GetIndexDataArrayRef ().end (),
indices.Get ());

Geometry::OptimizeVertexCache (MutableArrayRef (indices));

output = input;

output.SetIndexData (indices);
} else {
output = input;
}

return true;
}
};

/**
================================================================================
name = "OptimizeIndices",
flags = ["None"],
ui = {
name = "Optimize indices",
description =
[=[# Optimize indices

Optimizes the indices of an indexed mesh for better vertex cache usage. The input mesh must be already indexed.]=]
},
inputs = {
"Input" = {
type = "Stream",
ui = {
name = "Input file",
description = "Input file."
}
},
type = "Int"
ui = {
description = "Number of threads to use for processing."
}
default = 1
}
},
outputs = {
"Output" = {
type = "Stream",
ui = {
name = "Output file",
description = "Output file."
}
}
}
================================================================================
*/

/////////////////////////////////////////////////////////////////////////////
bool OptimizeIndices::ProcessImpl (const Build::ItemContainer& input,
Build::ItemContainer& output,
Build::IBuildContext& context)
{
const OptimizeIndicesProcessor processor;

return ProcessGeometryStream (input, output, processor, context);
}
} // namespace niven

There’s a tiny boilerplate header for this which declares the methods, but otherwise it’s empty. What do we notice? First, all inputs & outputs are specified right next to the source code using them. In this case, the ProcessGeometryStream method will fetch the input and output streams from the input and output container. All of this is type safe as the declarative types are converted into types used within my framework, and all queries specify the exact type.

It would be also possible to auto-generate a class which fetches the inputs and casts them into the right type, but that never became enough of a problem. This setup — with the integrated documentation and code — is what I call “least-overhead” coding. Sure, there is still some overhead to set up a build tool which slightly exceeds the amount of code for a command line tool which parses parameters directly, but the overhead is extremely small — some declarative structure and that’s it. In fact, some tools became smaller because the code to load files into streams and error handling was now handled by the build tool framework.

One interesting tid-bit is that the tool specifies an IStream — not a concrete implementation. This means I can use for instance a memory-backed stream if I compose tools, or read/write to files if the tool is started stand-alone. Previously, the command line tools could be only composed through files, if at all.

On the other hand, I get the benefits of a common infrastructure. For instance, tool discovery is now easily possible in different formats:

## Conclusion

In hindsight, all of this looks quite obvious — which is good, as it means the new system is easy to explain. However, during development, all of this was a long evolutionary process. At the beginning, I was trying to keep it simple as much as possible, with as few library, executables and boilerplate as possible. Over time, other parts of the framework also evolved (in particular, the boxing of primitive types which integrated them into the common class hierarchy came pretty late) which affected design decisions. Towards the end, I was also taking more and more advantage of the fact that the code was an integral part of my framework.

By tying it closer to the rest of the code base I could drastically cut down the special-case code in the tool library and reap lots of benefits. The downside is that extracting a tool from the framework is now very hard and will require a lot of work. This is the key tradeoff — going “all-in” on a framework may mean you have to live inside it. If done correctly, you can get a lot out of this, and I’m leaning more and more towards more infrastructure on projects. Good infrastructure to “plug” into is where large frameworks like Qt or Unreal Engine 4 shine, even if at the beginning, this means there is a steeper learning curve and more overhead. The key in such an evolution is to strive for the simple and obvious though and not introduce complexity for its own sake.

The other key decision — to move towards state-less, functional building blocks — turned out to be another big winner in the end. The disadvantages in terms of disk usage and sometimes I/O time were more than offset by testability, robustness and the ability to compose full processing pipelines with ease.

## Getting started with D3D12

Welcome to a short introduction to Direct3D 12 (also know as DX12, DirectX12 and D3D12) – the new graphics API from Microsoft, which brings new concepts to the table that have been introduced with Mantle. These new APIs could be classified as “explicit” APIs, as they have very few things that happen automatically unlike previous APIs like Direct3D 11 and OpenGL 4. In this blog post, I’ll introduce the basic concepts behind these new APIs. To follow along, I’d recommend that you check out my tiny D3D12 sample application which illustrates the techniques.

## Some kind of motivation

So why did these new APIs emerge? Let’s start with a motivating example. In D3D11, you can map a buffer for writing and specify the discard flag. That flag is actually a serious problem for the GPU. Let’s assume for a moment that the buffer hasn’t been used yet, and that a frame where it will be used is queued and being processed by the GPU. The driver can’t simply overwrite the buffer in GPU memory because when you submitted the frame, it wasn’t mapped, and time travel is still quite hard.

The driver has only two choices. The naïve one is to simply drain the GPU and wait for it to finish. Performance will be horrible if this happens for every map call, but it will be correct. The right choice is to simply create a new buffer, put the data in there, upload it to the GPU and track the original buffer. Once the frame where the original buffer is used finishes, the original buffer can be recycled and everything is fine. Except the driver now needs to manage a new buffer per map call — tricky, but possible.

If you think that’s just an example — no, it isn’t. This buffer replacement is called buffer renaming and is a standard technique used by D3D11 drivers. Depending on how large the rename buffer is, and how often buffers are discarded, it can work quite well but it means there has to logic in the driver to manage and track this.

## Going explicit

With D3D12, these things go away, and the developer is now directly exposed to memory management and synchronization. What does this mean exactly? Well, for starters, tracking of resources has to be done by the developer. If you look into my sample, you’ll notice I create “frame fences” which allow me to check if a frame has finished. For the constant buffers, I have one constant buffer per queued frame in a cheap-man’s ring buffer. Using the frame fence, I can synchronize with the GPU while still allowing the GPU queue to fill up. This removes the need for rename buffers from the driver.

Memory management is now also explicit, for instance, uploading does no longer happen “under the hood”. You’ll notice that I use two kinds of resources: Static data like the vertex and index buffer as well as the texture, and dynamic data like the constant buffer. For the dynamic data, which is read only once, it doesn’t make too much sense to push it to the GPU at all. In my sample, I hence place the constant buffer in CPU memory and let the GPU read that directly. In D3D11, the driver has to guess how often a buffer will be read and where to place it, but in D3D12, I can use the knowledge I have about my access patterns to optimize this.

The other data needs to be uploaded, and unlike D3D11 where this happens automatically, I have to do this on my own. Which means I need to reserve space on the CPU from where to stage the update, allocate some GPU memory, issue a copy and wait for it to finish before I use the resource. In the small sample, you can see that I wait for it to finish manually and hence keep everything deterministic but in a larger application I could take advantage of the copy queue and copy data independent of the rendering. This makes it easy to implement advanced streaming which was very hard to do before, as the driver can’t predict when a resource has to be resident on the GPU.

## Resource state tracking

Another completely new responsibility for developers is state tracking. In D3D11, resources transition between states automatically which can lead to bad performance. Imagine the following scenario: Four shadow maps are rendered and applied onto the scene. The application renders into a shadow map, changes the target, renders into the next and so on and then finally loops over the four shadow maps and reads them. What you may not know is that GPUs compress depth data to improve bandwidth and eventually performance, but the texture units may not be able to read that compressed data directly and hence require a decompression. This decompression can potentially require a flush and wait-for-idle to make sure that the compressed data is written completely and no longer used before it gets decompressed.

Now, if the driver is not careful, this could result in a decompress, flush, read cycle, four times. The reason for this is that the driver only notices that the decompression is needed when it sees that the resource is bound for reading. With D3D12, these transitions are now explicit and the developer can schedule them. In the example above, he can choose to decompress all four shadow maps at once in a single transition, pay the cost for the flush once and improve performance.

Another big area where the D3D11 driver spends time is setting and validating state. For instance, let’s assume you set a vertex and a pixel shader. The driver must check that the signatures of both match and this can only happen at draw time because the driver cannot precompute all permutations of vertex and pixel shaders to look this up. Often, the driver will even delay the compilation of a shader until it is used for the first time to improve startup time and easily skip unused shader. Games often have to to “pre-warm” the driver shader cache by touching all combinations once during loading to ensure that the gameplay doesn’t get interrupted when the driver starts to compile a shader.

In D3D12, this changes completely with the introduction of pipeline state objects which group all shaders and quite a bit of rendering state together. Grouping this data allows the driver to validate everything once and at runtime just swap the state without any further checks. It also means the driver can check if the pixel shader output is used at all and optimize the shader is some data is going to be discarded anyway. This is a huge change from previous APIs, and is also a major pain point when transitioning legacy engines which tend to identify the required combinations at run-time. In the D3D12 world, the shaders need to become part of the asset pipeline. In the sample, you can see how much state actually goes into the pipeline state object, even for a rather simple shader setup.

## Resource binding

Finally, resource binding in the D3D12 world is totally different from D3D11. Legacy APIs tend to model the GPU as something I call the “slot machine”. You have lots of different slots where you plug in textures, samplers, etc. This used to be the case how hardware worked but it’s not true since several years. If you look for instance at the GCN ISA documentation, specifically for “image resources”, you’ll notice that there is no “sampler slot” or “texture slot” being used there. Instead, the texture and sampler descriptor is loaded into a bunch of registers and that’s it. This new model is what is used by D3D12 through the root signature and descriptor tables.

The root signature serves as the first indirection level for resource bindings. It can contain some data in-line if it is small enough — for instance, a pointer to memory (also known as a constant buffer) or a few floats, or pointers to descriptor tables that can contain larger descriptors (for instance, texture descriptors.)

It is interesting that the root signature is still tracked with renaming, but as it is generally very small, this is not a huge problem (for best performance, it should be small and some other rules should be followed as well — check out this GDC 2015 presentation on D3D12 for details.) In the sample, you can see how the texture descriptor is placed in such a table and then referenced from the root signature. Again, the goal here is to allow to change large amounts of bindings very quickly. Unlike in D3D11, where the developer changes slots and the driver needs to map them to descriptors and build the table on demand, the developer can now swap for instance all textures and samplers required by a material by updating the descriptor table pointers in the root signature — a very cheap and fast operation.

## Things we didn’t look at

D3D12 also comes with explicit command buffers which allow multiple CPU threads to record commands. I’m not covering this here as the sample doesn’t take advantage of multiple threads — maybe some other time I’m also not covering the different queues exposed by D3D12 today. In D3D12, it is possible to execute a compute shader concurrently with draw calls and data transfers happening by taking advantage of the graphics, compute and copy queue. This is again and advanced feature which is no good fit for an introductory post.