Live OpenCL application profiling with AMD's GPUPerfAPI

Ever wanted to know what the actual performance of your OpenCL application is? Estimating memory bandwidth by guessing the number of bytes your kernel reads and writes and dividing this by the overall time? Trying to get a grip on cache behaviour?

If you answered any of these questions with yes, then this blog post is for you. I'm going to explain how you can add live profiling to your application using AMD's GPUPerfAPI, which will allow you to tune and profile your API at the same time; and with hot-loading of OpenCL kernels, potentially for the fastest update-profile loop you've ever seen.

Background

On the GPU, estimating performance is just as hard as an CPUs. For instance, GPUs come with multiple caches and complex memory controllers, making it really hard to actually estimate how much data has to be read to execute a kernel. I've been recently helping out instrumenting a complex numerical solver at work, and it turned out that total memory bandwidth estimation was off by a factor of 10. It's not because the computation was very sloppy, but because the actual cache hit rates and cache effects are very hard to estimate without resorting to extremely complex (and slow) simulators, which are often not feasible at all for the very large data sets you want to process on a GPU.

Fortunately, the hardware can help us in this situation. Modern GPUs have hardware performance counters, similar to CPUs, which allow for precise measurement. Typically, these are not exposed (or only to IHV specific tools), but at least on AMD, you can in fact read them using the excellent GPUPerfAPI library. Hardware performance counters are also interesting because they are nearly zero overhead and they require no changes to your code, which makes them non-intrusive.

The GPUPerfAPI provides access to the hardware counters for any application. For example, with this API, you can read:

  • FetchSize: The amount of data actually read from GDDR memory by the kernel. This is the amount of bandwidth your kernel required, which can be drastically lower than the estimated number due to caching.
  • CacheHit: The cache hit rate.
  • VALUUtilization: The coherency in the vector unit.

With these values alone, you can already get a very precise understanding how effective your kernel is. Besides those three, the API provides lots of additional counters; check out the documentation for details.

Getting started

The GPUPerfAPI is available for both Windows and Linux. On Windows, you can use it to profile OpenCL, OpenGL, Direct3D 10 and 11, and on Linux, for OpenCL & OpenGL. Initialization is a bit tricky, as there is no static library to link against (and you wouldn't want to link against one anyway.)

What I did is to write a simple wrapper, which opens the DLL/SO and then dynamically loads each function. This is very easy to do, in fact, there is a special function types header which contains all the function type definitions, which are otherwise pretty tedious to write. You can find a drop of my current wrapper library at Bitbucket. As of today, it's pretty basic and likely to contain a few bugs, but it should give you a good starting point nonetheless. Before you can do anything with the API, you have to initialize it by calling GPA_Initialize followed by GPA_OpenContext. For OpenCL, simply pass the command queue to GPA_OpenContext. Now you are ready to go.

Profiling itself consists of three steps:

  1. Querying and enabling counters
  2. Gathering data
  3. Retrieving results

To query the available counters, you first have to get the number of available counters using GPA_GetNumCounters. You can then enumerate and query them using GPA_GetCounterName, GPA_GetCounterDataType and so forth. Once you have found the counters you'd like to read, enable them using GPA_EnableCounter (which takes and index into the initial counter list) or GPA_EnableCounterStr (which accepts the counter name.)

You're now set to query data. Depending on how many counters you have activated, you might need multiple passes (i.e., multiple kernel executions) to read them all. Once enabled, call GPA_GetPassCount to get the number of passes. Gathering results itself is straightforward:

  • Call GPA_BeginSession to start a profile section. Save the session id to retrieve the results of this section later.
  • For each pass:

    • Call GPA_BeginPass
    • Call GPA_BeginSample -- you can have more than one sample per pass; for my current code, I simply use a single sample, always.
    • Run your kernel
    • Call GPA_EndSample, GPA_EndPass
  • Call GPA_EndSession

Now the results are somewhere, but not on the CPU yet. If you want to get maximum profiling efficiency, you can run multiple profile sessions before you start fetching the results. To query if the results are ready, use GPA_IsSessionReady; I have a blocking version of this in my code which simply calls it in a loop until results are ready.

Once you have the results, you have to re-associate them with the counters. This is straightforward as well:

  • Get the number of active counters using GPA_GetEnabledCount
  • Enumerate them and query the original index of the counter using GPA_GetEnabledIndex
  • You can now query the counter details again using GPA_GetCounterName, etc., or you just store the indices from the first enumeration and map them now
  • Retrieve the data using GPA_GetSampleUInt32, GPA_GetSampleUInt64, GPA_GetSampleFloat32, GPA_GetSampleFloat64

And that's it!

Results

Here are a few results from my ray-tracer. I've been measuring the cache hit rate as well as the amount of data that needs to be read for each frame. You can see it in the top-left corner. Having this data "on screen" is much more useful than running it in CodeXL and having it available post-run only, as I can now immediately correlate it with a particular view and experiment.

Far away view, low cache hit ratio.
Cache hit rate increases as moving closer to the mesh.
Up close, the cache hit ratio peaks out.

This is pretty amazing, as for the first time, I can actually get precise readings from the hardware for data which is otherwise very hard to obtain. I've written a SIMD simulator to guess the numbers you can see above, but an error of 50% is normal for a simulator compared to the real hardware data.

The profiling is also nearly without overhead, as you can see above, the rendering including profiling runs in less than 5 ms per frame. It's a no-brainer to add to any existing OpenCL application, and if you're serious about performance, you should integrate this right away -- you'll be amazed what the actual cache hit and coherency rates are!

Note for HD 7970 users and Catalyst 14.4

If you are using a Southern Islands card (7970 and similar), there is a known issue with the 14.4 Catalyst which will send you straight to blue-screen county. Just downgrade to a previous driver or wait for a future one. On newer hardware (R9 290), everything works fine. I ran into this issue when trying out the API initially :) Thanks Chris!

[Update] The issue has been resolved, with the Catalyst 14.6, everything works as expected again on the HD 7970.

Robust OpenCL initialization, part #2 (Optimus & friends)

I totally forgot it, but there is one thing related to robust OpenCL initialization which is difficult to impossible to solve robustly, and that is handling hybrid graphics. This blog post is NVIDIA/Intel specific, which is a very common configuration these days, and it will only affect you if you want to use graphics & compute interop.

The problem that you will run into is easy to explain:

  • The D3D device is created on the Intel integrated chip (device #0 if you enumerate them), which is (potentially) re-routed to the NVIDIA driver due to Optimus
  • OpenCL interop will not be aware of this, so the Intel OpenCL runtime will try to interop with a hijacked Intel driver, and fail

If Optimus is disabled (i.e. there is no NVIDIA graphics adapter), everything will work fine. Similarly, it'll just work if you have the integrated chip disabled. The problems only crop up if both devices are active and available.

Unfortunately, I don't have a good solutions for this problem. The most robust way seems to be to enumerate all devices and prefer NVIDIA over Intel, which may not be what you want (especially if the user asked for the integrated device.) Ideally, you'd like some query to check if Optimus is present and if it should be used for your application, and only then, use NVIDIA, but so far, I haven't found a way to do this (if you know a solution, please drop a line in the comments!)

I'm not sure what happens with AMD's equivalent (Enduro), but I would assume that it'll be similarly complicated. If you know it, please tell me so I can update this post accordingly!

Robust OpenCL initialization, part #1

Initializing and using OpenCL is a bit tricky, in particular on Windows and Linux, which don't come with OpenCL installed out of the box. Unlike OpenGL, which is always present, you cannot simply link against OpenCL and hope your application to even start, as it may not be present on the target machine at all. If you plan to ship an application which uses OpenCL, you'll need a robust way to detect if OpenCL is present and usable.

OpenCL initialization problems & solutions

As mentioned above, the first problem we have to solve is to find out whether OpenCL is present at all. The problem is that when you develop an OpenCL application, you typically work with a stub library. This stub library is the ICD, or Installable Client Driver, a small library responsible for dispatching the function calls to the OpenCL implementations. When your application starts up, the ICD searches for installed OpenCL platforms and loads them on your behalf. That's the theory at least. In practice, you'll run into two separate problems with the ICD: It may not be present at all, and if it is present, it may have the wrong version.

First one is easy, if you run on a fresh installation of Windows for example, there is no ICD at all, so your application will fail loading it and most likely crash right away. The second problem is similar, your application will start loading, but crash due to a missing import.

Fortunately, there is a good solution for the ICD problem. Instead of relying on some ICD, you can ship the Khronos ICD with your application (the license specifically allows to redistribute the binary without restriction.) It's a tiny library, built with CMake, which provides the complete OpenCL 1.2 ICD loader. By using this, you can at least guarantee that the ICD will be present on the target machine.

That brings us to the second problem, and that is to determine if the OpenCL implementation works. Even if the ICD is present, various stuff may still fail. The driver configuration can be broken, the driver may not be installed properly, or initializing OpenCL will simply crash. While these issues are less common, you'll still want to guard against them.

There's only one really good solution I can think of, which is also used by Adobe, as far as I know. What you do is you start a "test" process which tries to initialize an OpenCL context, and you check the return value of this process. If it exits normally and returns success, you consider OpenCL to be available and then you load your own compute library which is linked directly against OpenCL. If the process fails, you fall back to another implementation.

Conclusion

For the most robust OpenCL initialization, you'll want to use both solutions outlined above:

  • Ship your own ICD: This guarantees that you have a working ICD, and that this ICD has the version you expect.
  • Run a check process to test the waters: This will safeguard you against incomplete driver installations and other problems.

It's also a good idea to check the driver version reported by the OpenCL platform. If you know that some bug got fixed in a particular driver version, I would recommend to have a simple whitelist of vendors and driver versions and just use your fallback path if you find a known bad driver version.

In practice, you'll always find an ICD, as AMD, Intel and NVIDIA all ship OpenCL with their drivers. The most common problem you'll encounter is an outdated ICD (NVIDIA ships with OpenCL 1.1 only, and that includes their ICD) or a broken driver.

In case you have any questions, feel free to comment or send me a mail.

[Update]: For Optimus & other hybrid graphics initialization woes, make sure to check out part #2.

My GDC 2014 talk on OpenCL and realtime graphics

This post is for everyone who just grabbed the slides of my "OpenCL for realtime graphics" talk from GDC and wonders where the speaker notes are. So here we go :) Thanks again to Andrew & Neil for inviting me!

Interop API

Every time I mention OpenCL and graphics, people start by complaining about interop and how much code it is and how slow it is going to be. The truth is, the interop code itself is going to be really small, and there is no good reason why the interop has to be slower than DirectCompute or OpenGL compute shaders (or at least, I'm not aware of a good reason.) In fact, by having to tell the driver early on that you're going to use a resource later for compute, it can make better decisions than in DirectCompute where it only can identify the dependencies at dispatch call time. The only reason why performance with OpenCL may be worse is due to missing integration of OpenCL with the Direct3D/OpenGL drivers. Fundamentally, there is nothing in the specification or elsewhere which makes OpenCL/graphics interop inefficient.

Tile based deferred shading

I did that demo back in 2011, so some things may have changed since them -- in particular, undefined behaviour is undefined; you've been warned. It's really unfortunate that read/write images didn't make it into OpenCL 1.x, as they are supported by all Direct3D 11 compliant hardware.

I also mentioned in the slides that you can't map the depth buffer directly. The reason why you really want to do this is that it allows you to take advantage of the depth buffer compression. As a workaround I simply copied it into a separate texture, which, as far as I know, will trigger a decompression on current hardware. It's not going to make things horribly slow, but it's still nicer if you can just directly map the buffers and avoid yet another GPU memory copy.

Computational graphics

That's the "serious" compute stuff, where compute is used for large volume data visualization. In this case, the problem you typically run into is that GPU memory is fairly limited compared to CPU memory. The problem that everyone has in mind here is "portable performance", or basically the question: Can we have stuff which is efficient on both the GPU and CPU?

In my experience, the answer is yes, you can. You might not get the maximum performance on the CPU if you optimize for GPU first, but you're still going to be reasonably efficient. The 30% compared to ISPC in the slides is just a ballpark number, but what you have to keep in mind is that the ISPC path is using a different traversal code which is more suited for CPUs, and I've spent significant amounts of time optimizing it. The CPU OpenCL version in comparison was barely tweaked until it ran "fast enough". In the future, once CPUs ship with scatter & gather instructions, we can expect the gap to close significantly. Absolute performance will probably still differ, but efficiency on CPUs will go up, and by using OpenCL, your code will be able to take advantage of these improvements immediately without having to do anything on your side.

Shipping products

The key takeaways here are:

  • You can ship software relying on OpenCL/graphics interop now
  • You should contact your IHV early

Basically, there's no real blocker preventing you from shipping applications using OpenCL. People have drivers which support it, compilers are robust enough, and the interop works on Windows & Linux. Sure, there are issues from time to time, but nothing is like horribly broken or cannot be worked around most of the time.

One issue we have been running into regularly is GPU memory management, which is not as robust as we'd like it to be. Graphics developers will be familiar with this problem; for compute, it's even worse as buffers tend to be larger there.

Regarding the stricter implementation: AMD's compiler tends to follow the specification very closely, similar to their OpenGL compiler which is also very strict. For instance, it'll catch implicit double to float conversions, even if you just want to convert 1.0. We typically develop with all warnings turned on and all warnings being treated as errors. You do have to test on different machines regularly though, as even standard-compliant code sometimes results in problems. For example, we ran into issues where one compiler simply would fail on nested loops; in such cases, you have to bite the bullet and restructure your code. That said, these are getting increasing rare.

Conclusion

To sum it up: OpenCL is ready for graphics, but there are various issues outside of OpenCL which have hindered it success so far. I expect this to change once OpenCL 2.0 is available as it'll provide capabilities going far beyond OpenGL compute shaders and DirectComput and improve the graphics interop capabilities at the same time. For games, OpenCL 2.0 is likely to be a good target.

However, if you want to use OpenCL for your content creation tools, OpenCL 1.1/1.2 is ready & good to go. For example, I don't see why you would want to write a ray-tracer for light map baking in DirectCompute if you can get both the interactive version and the offline version for free by using OpenCL. Or write code twice, just because you have to support OpenGL and Direct3D.

So that's it, I hoped you enjoyed the talk! If you have questions, just use the comment function or contact me directly.

Porting from Windows to Linux, part 3

Welcome to the final post of my porting from Windows to Linux series. At this point, I assume you have built your application on Linux. If not, keep on porting :) Otherwise, read on for hints on how to keep your Linux and Windows build working, and how to take advantage of the fact that your code is working on Linux now.

Code quality tools

Having your code on Linux means you also get access to Linux-specific tools. There are three tools I want to talk about which you should know:

  • Clang
  • Valgrind
  • Address/Thread Sanitizer

Let's start with Clang. Clang is a production-quality C++ compiler built on top of the excellent LLVM framework. It has full C++ 11 support, provides useful error messages and compiles faster than any other compiler I'm aware of. For portability, it is always useful to check against more than one compiler, and once you have ported over to Linux, using Clang for compiling is trivial. If you use CMake, you can just specify the native compiler to be clang++ instead of g++ and you're ready to go. You can expect roughly 20-30% faster compile times.

Valgrind is a memory validation tool (it can be also used for profiling.) Basically, Valgrind runs your application and interprets every instruction. At the end, it can report memory leaks, uninitialized memory reads and other memory errors with complete call stacks back to you. The downside is that your code is going to run at least one order of magnitude slower. This makes it not that useful for testing complete applications, but it's great for your unit and functional test suites! While Valgrind cannot detect all errors, it's the tool to detect heap memory leaks and other issues.

As said above, Valgrind makes your code at least 10x slower. This has (recently) led to the development of new sanitization tools like the address sanitizer. Instead of executing the whole code in a virtual machine, the address sanitizer instruments load/store instructions. The compiler also has to cooperate, but fortunately, both Clang and GCC support the address sanitizer. Due to the instrumented code, the resulting binary is much faster than running it under Valgrind. The only downside is that it needs a recompilation -- you'll basically get a new build configuration for extended debugging. I haven't used it enough on my code to talk about results yet, but if you ask whether it's worth it, I'd suggest you look at the list of bugs found using address sanitizer. Similarly, there is also a new thread sanitizer which will hopefully be just as useful for thread synchronization bugs.

Build automation

If your build is not fully automated, you should be definitely thinking about this now. As mentioned above, Linux provides several great tools which can be used to automatically check your code, but these tools are only useful if they run regularly. Moreover, once you start supporting Clang and GCC on Linux, and MSVC on Windows simultaneously, chances are high that one of the configurations will break at some point.

The way to combat this is to rely on build automation. You might think that you have enough people working on the code to capture all issues, but that's not enough. With Clang, GCC, MSVC, debug, release, Valgrind and address sanitizers you already have a build matrix consisting of 14 possible configurations -- some of which are a pain to work with, like a debug build with Valgrind. Take your time and set up a server which builds all configurations over night. Ideally, your nightly builds will also execute the unit test suite, so you can capture both compile errors as well as runtime errors. If you need a cross-platform build tool, I can recommend Buildbot, which does its job and is reasonably easy to configure.

This concludes the porting to Linux series! If you still have questions, feel free to comment and I'll try my best to address them.