Skip to main content

More compute shaders

Last week I've covered compute shaders, and I've been asked to go a bit deeper on the hardware side to cover subgroups and more. But before we get there, let's recap briefly what a compute unit looks like and what is happening in it.

In the last post, I explained that the hardware is optimized for many items executing the same program. This resulted in the usage of very wide SIMD units (in the case of AMD's GCN, 16-wide), hiding memory latency through task switching, and a branching model which relies on masking. I didn't get into too much detail about those though, which we'll do now, and after that, we'll find out what to do with our new-found knowledge!

SIMD execution

When it comes to executing your code, a GCN compute unit has two main building blocks you care about: Several SIMD units and a scalar unit. The SIMD units are each 16 wide, meaning they process 16 elements at a time. However, they don't have a latency of one -- they don't finish executing an instruction in a single clock cycle. Instead, it takes four cycles to process an instruction from start to end (some take longer, but let's pretend all take four cycles for now.) Four cycles is about the speed you'd expect for something like a fused-multiply-add, which needs to fetch three operands from the register file, do a multiply-and-add, and write the result back (highly optimized CPU designs also take four cycles, as can be seen in Agner Fog's instruction tables.)

Latency doesn't equal throughput, though. An instruction with a latency of four can still have a throughput of one, if it's properly pipelined. Let's look at an example:

Comparison of pipelined and non-pipelined execution.
A non-pipelined unit compared to a pipelined execution unit. Latency is four cycles in both cases, but once filled, the pipelined unit has a throughput of one instruction per cycle.

Which means that if we have sufficient work to do for a single SIMD, we can get 16 FMA instructions executed per clock. If we could issue a different instructions in every cycle, we'd have to deal with another problem though -- some of the results may not be ready. Imagine our architecture would have a four-cycle latency on all instructions, and a single-cycle dispatch (meaning every cycle we can throw a new instruction into the pipeline.) Now, we want to execute this imaginary code:

v_add_f32 r0, r1, r2
v_mul_f32 r3, r0, r2

We have a dependency between the first and the second instruction -- the second instruction cannot start before the first has finished, as it needs to wait for r0 to be ready. This means we'd have to stall for three cycles before we issue another instruction. The GCN architects solved this by issuing one instruction to a SIMD every four cycles. Additionally, instead of executing one operation on 16 elements and then switching to the next instruction, GCN runs the same instruction four times on a total of 64 elements. The only real change this requires to the hardware is to make the registers wider than the SIMD unit. Now, you never have to wait, as by the time the v_mul_f32 starts on the first 16 elements, the v_add_f32 just finished them:

Instruction issue for a single SIMD on GCN.
Instruction issue on a single GCN SIMD. One instruction is issued every four clock cycles, and instructions can start immediately one after the other.

You'll immediately notice those wait cycles, and clearly that's no good to have a unit spend most of its time waiting. To fill those up, the GCN designers use four SIMD units, so the real picture on the hardware is as following:

Instruction issue for a single CU on GCN.
By duplicating the SIMD four times, one instruction can be issued every clock cycle, bringing the total throughput to 64 elements/clock (4 SIMDs à 16 elements/SIMD/clock).

This 64-wide construct is called a "wavefront" or "wave", and it's the smallest unit of execution. A wave can get scheduled onto a SIMD to execute, and each thread group consists of at least one wave.

Scalar code

Phew, that was quite a bit, and we're unfortunately still not done. So far, we pretended everything is executed on the SIMD, but remember when I wrote there are two blocks related to execution? High time we get to this other ... thing.

If you program anything remotely complex, you'll notice that there are really two kinds of variables: Uniform values, which are constant across all elements, and non-uniform values, which differ per lane.

A non-uniform variable would be for instance the laneId. We'll pretend there's a special register we can read -- g_laneId, and then we want to execute the following code:

if (laneId & 1) {
    result += 2;
} else {
    result *= 2;

In this example, we're not going to talk about conditional moves, so this needs to get compiled to a branch. How would this look like on a GPU? As we learned, there is something called the execution mask which controls what lanes are active (also known as divergent control flow.) With that knowledge, this code would probably compile to something like this:

v_and_u32 r0, g_laneId, 1   ; r0 = laneId & 1
v_cmp_eq_u32 exec, r0, 1    ; exec[lane] = r0[lane] == 1
v_add_f32 r1, r1, 2         ; r1 += 2
v_invert exec               ; exec[i] = !exec[i]
v_mul_f32 r1, r1, 2         ; r1 *= 2
v_reset exec                ; exec[i] = 1

The exact instructions don't matter, what matters is that all the values we're looking at that are not literals are per-lane values. I.e. g_laneId has a different value, as has r1, for every single lane. This is a "non-uniform" value, and the default case, as each lane has its own slot in a vector register.

Now, if the control flow was looking like this, with cb coming from a constant buffer:

if (cb == 1) {
    result += 2
} else {
    result *= 2;

Turning this into this straightforward code:

v_cmp_eq_u32 exec, cb, 1    ; exec[i] = cb == 1
v_add_f32 r1, r1, 2         ; r1 += 2
v_invert exec               ; exec[i] = !exec[i]
v_mul_f32 r1, r1, 2         ; r1 *= 2
v_reset exec                ; exec[i] = 1

This has suddenly a problem the previous code didn't have -- cb is constant for all lanes, yet we pretend it's not. As cb is an uniform value, the comparison against 1 could be computed once instead of per lane. That's how you'd do it on a CPU, where vector instructions are a new addition. You'd probably do a normal conditional jump (again, ignore conditional moves for now), and call a vector instruction in each branch. Turns out, GCN has the same concept of a "non-vector" execution, which is aptly named "scalar" as it operates on a single scalar instead of a vector. In GCN assembly, the code could compile to:

s_cmp_eq_u32 cb, 1          ; scc = (cb == 1)
s_cbranch_scc0 else         ; jump to else if scc == 0
v_add_f32 r1, r1, 2         ; r1 += 2
s_branch end                ; jump to end
v_mul_f32 r1, r1, 2         ; r1 *= 2

What does this buy us? The big advantage is that those scalar units and register are super cheap compared to vector units. Whereas a vector register is 64x32 bit in size, a scalar register is just 32 bit, so we can throw many more scalar registers on the chip than vector registers (some hardware has special predicate registers for the same reason, one bit per lane is much less storage than a full-blown vector register.) We can also add exotic bit-manipulation instructions to the scalar unit as we don't have to instantiate it 64 times per CU. Finally, we use less power as the scalar unit has less data to move and work on.

Putting it all together

Now, being hardware experts, let's see how we can put our knowledge to good use finally. We'll start with wavefront wide instructions, which are a hot topic for GPU programmers. Wavefront wide means we do something for all lanes, instead of doing it per lane -- what could that possibly be?

Scalar optimizations

The first thing we might want to try is to play around with that execution mask. Every hardware has this present in some form, be it explicit or as a predication mask. With that, we can do some neat optimization. Let's assume we have the following code:

if (distanceToCamera < 10) {
    return sampleAllTerrainLayers ();
} else {
    return samplePreblendedTerrain ();

Looks innocent enough, but both function calls sample memory and are thus rather expensive. As we learned, if we have divergent control flow, the GPU will execute both paths. Even worse, the compiler will likely compile this to following pseudo-code:

VectorRegisters<0..32> allLayerCache = loadAllTerrainLayers();
VectorRegisters<32..40> simpleCache = loadPreblendedTerrainLayers();
if (distanceToCamera < 10) {
    blend (allLayerCache);
} else {
    blend (sampleCache);

Basically, it will try to front-load the memory access as much as possible, so by the time we reach the else path, there's a decent chance the loads has finished. However, we -- as developers -- know that the all-layer variant is higher quality, so how about this approach: If any lane goes down the high-quality path, we send all lanes to the high-quality path. We'll have slightly higher quality overall, and on top of that, we get two optimizations in return:

  • The compiler can use fewer registers by not front-loading both
  • The compiler can use scalar branches

There's a bunch of functions for this, all of which operate on the execution mask (or predicate registers, from here on I'll pretend it's the execution mask.) The three functions you'll hear of are:

  • ballot () -- returns the exec mask
  • any () -- returns exec != 0
  • all() -- returns ~exec == 0

Changing our code to take advantage of this is trivial:

if (any (distanceToCamera < 10)) {
    return sampleAllTerrainLayers ();
} else {
    return samplePreblendedTerrain ();

Another common optimization applies to atomics. If we want to increment a global atomic by one per lane, we could do it like this:

atomic<int> globalAtomic;
if (someDynamicCondition) {

This will require up to 64 atomic increments on GCN (as GCN does not coalesce them across a full wavefront.) That's quite expensive, we can do much better by translating this into:

atomic<int> globalAtomic;
var ballotResult = ballot (someDynamicCondition);
if (laneId == 0) {
    globalAtomic += popcount (ballotResult);

Where popcount counts the number of set bits. This cuts down the number of atomics by 64. In reality, you probably still want to have a per-lane value if you're doing a compaction, and it turns out that case is so common that GCN has a separate opcode for it (v_mbcnt) which is used automatically by the compiler when doing atomics.

Finally, one more on the scalar unit. Let's assume we have a vertex shader which passes through some drawId, and the pixel shader gets it as a normal interpolant. In this case (barring cross-stage optimization and vertex-input-layout optimization), code like this will cause problems:

var materialProperties = materials [drawId];

As the compiler does not know that drawId is uniform, it will assume it can be non-uniform, and thus perform a vector load into vector registers. If we do know it's uniform -- dynamically uniform is the specific term here -- we can tell the compiler about this. GCN has a special instruction for this which somewhat become the "standard" way to express it -- v_readfirstlane. Read-first-lane reads the first active lane and broadcasts its value to all other lanes. In an architecture with separate scalar registers, this means that the value can be loaded into a scalar register. The optimal code would be thus:

var materialProperties = materials [readFirstLane (drawId)];

Now, the materialProperties are stored in scalar registers. This reduces the vector register pressure, and will also turn branches that reference the properties into scalar branches.

Vector fun

So much for the scalar unit, let's turn to the vector unit, because things get really funny here. Turns out, pixel shaders have a huge impact on compute, because they force the hardware to do something really funky -- make lanes talk to each other. Everything we learned so far about GPUs said you can't talk across lanes except through LDS, or by broadcasting something through scalar registers (or read a single lane.) Turns out, pixel shaders have a very unique requirement -- they require derivatives. GPUs implement derivatives by using quads, i.e. 2×2 pixels, and exchange data between them in a dynamically varying way. Mind blown?

Pixel shaders access neighboring lanes during ddx, ddy instructions.
Pixel shaders can access neighboring lanes through ddx(), ddy() instructions. Each lane processes one pixel, and within 4 lanes, a lot of exchange is required to make derivatives work. On the right side, we can see the initial packing, and how the derivatives swap data between lanes within four lanes.

This is commonly referred to as quad swizzle, and you'll be hard pressed to find a GPU which doesn't do this. However, most GPUs go much further, and provide more than just simple swizzling across four lanes. GCN takes it really far since the introduction of DPP -- data parallel primitives. DPP goes beyond swizzling and provides cross-lane operand sourcing. Instead of just permuting lanes within a lane, it actually allows one lane to use another lane as the input for an instruction, so you can express something like this:

v_add_f32 r1, r0, r0 row_shr:1

What does it do? It takes the current value of r0 on this lane and the one to the right on the same SIMD (row-shift-right being set to one), adds them together, and stores it in the current lane. This is some really serious functionality which introduces new wait states, and also has various limitations as to which lanes you can broadcast to. All of this requires intimate knowledge of the implementation, and as vendors differ in how they can exchange data between lanes, the high level languages expose general wave-wide reductions like min etc. which will either swizzle or use things like DPP to get to one value. With those, you can reduce values across a wavefront in very few steps, without access to memory -- it's faster and still easy to use; what's not to like here!


I hope I could shed some light on how things really work this time around. There's really not much left that hasn't been covered so far, for GCN, I can only think of separate execution ports and how waits are handled, so maybe we'll cover that in a later post? Other than that, thanks for reading, and if you have questions, don't hesitate to get in touch with me!

Introduction to compute shaders

A couple of months ago I went to the Munich GameCamp -- a bar camp where anyone can propose a talk, and then a quick vote is cast which talks get accepted. I've been asking around for some ideas, and one developer mentioned I might want to cover compute shaders. So I went in, without much hope to attract many people, but I ended up with an overcrowded room with roughly one quarter of all the game camp participants, rambling about compute shaders for roughly an hour. Afterwards, the main question I got was: "Where can I read about this?", and I couldn't quite nail a good introduction to compute shaders (there's Ryg's "a trip through the graphics pipeline", but that's already quite past an introduction.)

The hardware

To understand how compute shaders happened, we have to take a look at the hardware evolution. Back in the old days, before shaders, we had geometry processing and texturing separated. For instance, a Voodoo² card had one rasterizer and two texturing units on the card, splitting the workload between them. This theme continued for a long time, even after shaders were introduced. Up until the GeForce 7 and the Radeon X1950, GPUs had separate vertex and pixel shader units. Those units had usually similar capabilities in terms of what they could compute (after all, additions and multiplications are the bulk of the work on a GPU), but memory access differed a lot. For instance, accessing textures was something vertex shaders couldn't do for a long time. At that time, the split made sense as scenes consisted of few polygons, covering many pixels, so having less vertex shading power typically didn't result in a bottleneck. By removing functionality from the vertex shaders, they could be better optimized and thus run faster.

A basic vertex & pixel pipeline.
A basic GPU pipeline with separate vertex and pixel units. Vertex units write shaded vertices into an attribute buffer, the pixel units consume it and also access memory through texture units.

However, a fixed distribution also makes it impossible to load-balance resources. As games evolved, sometimes more vertex power was required -- for instance, when rendering dense geometry like trees -- while other games were exploiting the new shader programming models to write more complex pixel shaders. This was also the time at which GPGPU programming starting, that is, programmers were trying to solve numerical programs on GPUs. Eventually, it became clear that the fixed split was not good enough.

In late 2006 -- early 2007, the age of "unified shaders" began with the release of the GeForce 8800 GTX and the Radeon HD 2800 (technically speaking, the XBox 360 was first with a customized ATI chip.) Gone were the days of separate units; instead, the shader core could process any kind of workload.

A basic, unified vertex & pixel pipeline.
An unified pixel & vertex pipeline. The units can all talk to memory through the texture units, and pass data between them through an attribute buffer.

So what happened back then? The ALUs -- the units executing the math instructions -- were already similar. What changed was that the distinction between the units was removed completely. Vertex shaders were now executing on the same units as pixel shaders, making it possible to balance the workload between vertex and pixel shader work. One thing you'll notice here is that vertex and pixel shaders need to communicate somehow -- a decent chunk of memory is needed to store attributes for interpolation, like vertex color. A vertex shader will compute those per vertex and move on to the next vertex, but a pixel shader might have to hang on this information for a while until all pixels are processed. We'll come to this memory later, let's remember if for now.

A basic compute pipeline.
The unified pipeline, shown as a compute pipeline. The attribute buffer becomes the local or shared memory, the texture units become the gateway to global memory, and the pixel and vertex units are now general compute units.

This model -- some ALUs bundled together, with some extra memory to allow communicating between sets of them -- was exposed at the same time as a "compute shader". In reality there are a few more details to this, for instance, a compute unit usually does not process a single element, but multiple of those, and there are quite a few more caches involved to make all of this efficient. Below is the diagram for a single compute unit in the AMD GCN architecture -- note that what I previously called "compute unit" is now a "SIMD", more on this in a moment.

An advanced compute pipeline.
An AMD GCN class compute unit, consisting of four 16-wide SIMD units, local memory, and various caches. Blocks with a dashed outline are shared between more than one compute unit.

Note that other GPUs look very similar, I'm simply using GCN here as I'm most familiar with it. On NVIDIA hardware, a GCN compute unit maps to a "streaming multiprocessor", or SM. The SIMDs have different width, and the caches will look a bit different, but the basic compute model is still exactly the same.

Back to the the SIMD unit -- GPUs have been always optimized to process many things concurrently. Where you have one pixel, you have many, and all of them run the same shader, so this is how the hardware was designed. In the case of GCN, the designers built four, 16-wide SIMDs into each compute unit. A SIMD (SIMD is short for single instruction, multiple data) unit can execute one operation across 16 elements at once. Not in one clock cycle -- there's some latency -- but for GCN, that latency is four cycles, so by having 4 SIMD, and pretending a SIMD is not 16, but 64 wide, the machine behaves as-if it executes 64 instructions per cycle. Confused? Don't worry, this is just an implementation detail, the point here is that every GPU executes some set of instructions together, be it 64 in the GCN case or 32 on current NVIDIA architectures.

What we ended up with is a GPU consisting of many compute units, and each compute unit containing:

  • A bunch of SIMD processors, which execute the instructions.
  • Some local memory inside each compute unit which can be used to communicate between shader stages.
  • Each SIMD unit executes one instruction across many items.

With those prerequisites, let's have a look at how we might expose them when not doing pixel and vertex shader work. Let's get going!

The kernel

The goal is to write a programming system which allows us to take advantage of the hardware at hand. Obviously, the first thing we notice is that cross-compute-unit communication is something we want to avoid. All of the compute units are clients of the L2 cache, but clearly the L1 caches can go out of sync, so if we distribute work, we should pretend we can't talk across compute units. This means that we somehow have to split our work into smaller units. We could go just pretend there are no compute units and dispatch individual items, but then we're missing out on the local memory, so it seems that we want one level below "here's all work".

We could go directly at the SIMD level, but that's not optimal as there is a capability we haven't discussed yet within a compute unit. As we have some kind of local memory, it's natural to assume we can use that to synchronize the SIMD units (given we could just write something to local memory and wait for it to show up). This also means we can only synchronize within a single compute unit, so let's use that as the next grouping level -- we'll call the work that gets put onto a single compute unit a work group. We thus start with a work domain, which is then split into work groups. Each work group runs independently of the others. Given the developer might not know how many groups can run concurrently, we enforce that work groups can't depend on the fact that:

  • Another work group from the same domain is running
  • Work groups inside a domain execute in any particular order

That allows us to run one work group after the other on a single unit, or all of them at once. Now, inside a work group, we still need many independent items of work, so we can fill all of our SIMD units. Let's call the individual item of work a work item, and our programming model will launch work groups containing many work items -- at least enough to fill up one compute unit (we probably want more than just enough to fill it, but more on this later.)

Let's put all of this together: Our domain consists of work groups, each work group gets mapped onto one compute unit, and the SIMD units process the work items.

A basic programming model
Here is how our programming model maps onto the hardware so far. A domain consists of work groups, those get assigned to compute units. Each workgroup is comprised of multiple work items.

One thing we forgot to mention so far is the fact that there's apparently one more level which we didn't cover. I mentioned a SIMD unit executes multiple work items together, yet we haven't exposed this in our programming model, we only have "independent" work items there. Let's give the work items executing together a name -- we'll call them a subgroup (on AMD, they're typically referred to as a "wavefront" or "wave", while NVIDIA calls them "warp".) Now, we can expose another set of instructions, which perform operations across all items in a SIMD. For instance, we could compute the minimum of all values without having to go through local memory.

I mentioned before that there are only very few constraints on ordering. For instance, a GPU may want to execute a whole domain on a single compute unit. One of the reasons for doing this is memory latency. Those wide SIMD units are great when it comes to number crunching, but it also means that memory access is relatively speaking super slow. In fact, it's horribly slow, because a GCN CU can process 64 float multiply-add instructions per cycle, which is 64×3×4 bytes of input data, and 64×4 bytes of out. Across a large chip like a Vega10, that's 48 KiB worth of data read in a single cycle -- at 1.5 GHz, that's 67 TiB of data you'd have to read in. GPU memory systems have been thus optimized for throughput, at the expense of latency, and that impacts how we program them.

Before we head into the effects of this on the programming model, let's summarize how compute shaders look like:

  • Work is defined through a 3-level hierarchy:
    • A domain specifies all work
    • The domain is subdivided into work groups, which are executed independently, but allow for communication inside a group
    • Work items are the individual elements of work
  • Some APIs also expose an intermediate level -- the subgroup, which allows some optimizations below the work group, but above the work item level.
  • Work groups can synchronize internally and exchange data through local memory.

This programming model is universal across all GPU compute shader APIs. Differences occur in terms of what guarantees are provided, for instance, an API might limit the number of work groups in flight to allow you to synchronize between all of them; or some hardware might guarantee the execution order so you can pass information from one workgroup to the next.

Latency hiding & how to write code

Now that we understand how the hardware and the programming model looks like, how do we actually program this efficiently? Until now, it sounds like we're writing normal code, and all that happens is that we process a work-item in it, potentially communicating between our neighbors on the same SIMD, others through local memory, and finally read and write through global memory. The problem is that "normal" code is not good, as we need to hide latency.

The way a GPU hides latency is by having more work in flight. A lot more work -- taking GCN again as the example, every compute unit can have up to 40 subgroups in flight. Every time a subgroup accesses memory, another one gets scheduled in. Given only four can execute at the same time, this means we can switch up to 10 times before we return to the subgroup which started the original request.

Latency hiding
The first subgroup hits a memory instruction and stalls. The compute unit immediately schedules the next subgroup, which also stalls. After the fourth subgroup has issued an instruction, the memory subsystem has returned, and the first subgroup can resume processing. This way, the compute unit was executing work at every cycle.

There's a problem here though -- subgroups switching needs to be instantaneous to make this possible. Which means you can't write the program state to memory and read it back. Instead, all state of all programs is kept permanently resident in registers. This requires a huge amount of registers -- a single compute unit on GCN has 256 KiB of registers. With that, we can use up to (256 KiB / 40 / 4 / 64B) = 24 registers for a single item before we need to reduce occupancy. For our programming style, this means we should try to minimize the amount of state we need to keep around as much as possible, at least as long as we're accessing memory. If we don't access memory, a single wave can keep a SIMD 100% occupied. We should also make sure to use that local memory and L1 cache as much as we can, as they have more bandwidth and less latency than the external memory.

Execution on SIMD
One SIMD can process multiple work items in a single clock cycle, in this figure, it's 8-wide. If a branch is short, we don't waste much computation. Long branches, which disable most of the SIMD lanes however can drastically reduce our throughput.

The other problem we're going to face is how branches are handled. SIMDs execute everything together, so they can't perform jumps. What happens instead is that the individual lanes get masked off, i.e. they no longer participate. It also implies that unless all work items take the same branch, a GPU will typically execute both sides of the branch.

Which means that for a SIMD of length N, we can get a worst-case usage of 1/N. On CPUs N is usually 1..8, so it's not catastrophic, but on a GPU N can be as high as 64 at which point this does really matter. How can we make sure this doesn't happen? First, we can take advantage of the subgroup execution; if we have a branch to select between a cheap and expensive version, and some work items take the expensive version, we might as well all take it. This reduces the cost of a branch from expensive + cheap to just expensive. The other part is to simply avoid excessive branching. As CPUs get wider, this is also becoming more important, and techniques like sorting all data first, and the processing it become more interesting than heavy branching on individual work items. For instance, if you're writing a particle simulation engine, it's much faster to sort the particles by type, and the run a specialized program for each, compared to running all possible simulation programs for each particle.

What have we learned? We need:

  • A large problem domain -- the more independent work items, the better.
  • Memory intensive code should minimize the state, to allow for high occupancy.
  • We should avoid branches which disable most of the subgroup.

Those guidelines will apply universally to all GPUs. After all, all of them were designed to solve massively parallel graphics problems, and those have little branching, not too much state in flight, and tons and tons of independent work items.


I hope this blog post gives you a good idea how we ended up with those compute shaders. The really interesting bit is that the concepts of minimal communication, local memory access, and divergence costs are universally applicable. Modern multi-core CPUs are also introducing new levels of communications costs, NUMA has been long used to model costs for memory access, and so on. Understanding that not all memory is equal, and that your code is executed in some particular way on the underlying hardware will enable you to squeeze out more performance everywhere!

Project status

As many other developers, I wrote many projects over the years, and because I do believe in open source, I released most of them in public. As I focused on scratching my own itch, nearly all of my projects are not very popular, and some of them are mostly dead these days. However, if you went to my project overview, you wouldn't be able to tell what state each project is in.

This weekend, I sat down to solve this once and for all, by creating an inventory of my projects, categorizing the status, and getting this all on my website in an easy-to-digest fashion. Sounds straightforward, but there are a few details which are hopefully of interest: Maintenance work, inventory management, and figuring out what states I want to actually report. Let's dive in!


Going through my various projects I noticed that my most successful projects in terms of users and community contribution were in a rather sad state. No easy-to-find readme, no continuous integration, and even worse, no changelog. Those are the three main things I look for in other projects, and I failed my own minimum bar here.

Solving the changelog is rather simple, but the continuous integration and the readme required a bit more work. You might think how can a readme require extra work? Before we get to that, a bit of context first: The most popular projects of mine are pure Python projects. This means there's exactly one way to publish them, which is PyPI. PyPI uses setuptools for distribution, and there have been quite a few changes since I originally uploaded my projects.

These days, uploading to PyPI requires twine. Additionally, as PyPI moved to warehouse, update paths and more need to get adjusted. Fortunately, the documentation for package deployment was also significantly improved. Some of this may sound like busywork, but it's actually worth spending the time. For me, it meant I could include some nicely formatted readme as the long project description, for instance -- which required some extra markup in

I also used the opportunity to get some continuous integration in place. I'm not a big fan of the git monoculture, so I use Mercurial as my primary version control management, and hence Bitbucket. Unfortunately, popular services like Travis CI don't support Bitbucket, so continuous integration was usually a pain point. Turns out, Bitbucket solved the problem themselves by introducing BitBucket pipelines, which is a very simple way to run a project inside a Docker container. With some updates to my tox scripts, I got this running for my projects in no time.

With this out of the way, it was time to figure out what projects I had floating around besides the few projects I updated: Time for some archeology!


The first step is getting a proper inventory of all projects. As usual, I have a lot of choice as to what is my "source of truth". Is it Bitbucket, where nearly all of my projects live? Is it PyPI? Do I store it inside each project or do I use some other place?

I was tinkering around with just using Bitbucket. Turns out, nearly all my projects are in fact on Bitbucket, so stuffing a meta.json in there seemed like a good choice. I even have a mirror script for Bitbucket which I use to back up my repositories locally, so that seemed like the natural choice … except for my closed source projects. To get that to work, I would have to index my local repository store which contains both my public and private repositories. Given I have roughly a dozen projects I care about, this started to smell like some serious overkill, but I'd probably go for this if I had more projects floating around (or if I had more time to automate some things.)

Turns out, I do have a decent index already -- the project descriptions I'm using to generate my website in the first place. Liara -- the backend for the website -- uses a YAML block at the beginning of each page to store arbitrary metadata, and that's now the new canonical index. Problem solved, but now comes the next problem: What do we want to store there?


The key bit of information is what status a project can be in. There are some obvious candidates like "active" and "abandoned", but there are a few more states which I think are worth discerning. Looking through all my projects, these are the states I settled with:

  • Active: I'm working on this project, which means I'm doing some work at least once a quarter. Plus if someone reports an issue or requests a feature, there's a decent changes it will actually happen.
  • Maintenance: The project is good enough, all features I wanted to write are in there, and there's really no reason to keep working on it. However, if someone would ask for a new feature or a bug fix, I'll most likely get to it in a timely manner.
  • Archived: Similar to maintenance, but I'm not going to fix bugs or add new features even if requested. Those are projects which are online for reference purposes, hopefully useful but not popular enough to warrant spending time on them.
  • Stalled: That's a tricky one, as it's kind of similar to the next state -- abandoned. The main difference here is that a stalled project solves some problem, but it's probably not generally useful. An archived project is completely done, I achieved everything I wanted there. For a stalled project, I have some things in mind for how to continue, but no good use case to make it happen or justify additional effort.
  • Abandoned: That's the case everyone is familar with. I started hacking but due to lack of motivation or users, I stopped before it was useful. That's the main difference to stalled -- stalled does work for some limited use cases, abandoned probably never got beyond proof-of-concept.
  • Deprecated: This is a special state for projects which are no longer relevant. I only have few projects in this state, where the upstream project moved along and hence there's no reason any more to use my library. It's similar to archived, but due to factors outside of my control, it's no longer useful.

On top of that, I also added a quality tag for production-ready projects. Some libraries like SJSON are extensively tested (more than 90% code coverage), have been used by other developers, and are also well-documented. Sticking the maintenance tag onto those doesn't seem to do them justice, so if you happen to look at the details of one of those projects, the status will mention it's production ready.


Besides a much friendlier overview page, I also got something I was missing a lot -- peace of mind. Previously, any project which was in "limbo" was a constant mental weight, as I couldn't put it to rest properly. Now I have a simple way to clearly mark it as done and I can finally stop thinking about it without having a bad conscience about it. Time to move on to new projects 😊!

Save games vs. hard mode

Today, something completely different than the usual programming stuff I've been recently blogging about. It's time for some gaming again! Or more specifically, time to discuss one of the major problems I have with a bunch of games -- the save game handling!

What's the deal?

I'm a highly irregular gamer these days. The times when I can sit down for a couple hours in a row to practice a game are mostly gone. Instead, I often have half an hour or an hour I can spend on a game and I want to make some progress there. Which is usually fine, as many games these days allow you to consume the content in small chunks with a decent save game system. With decent, it means the save game system implements (hopefully all) of the following features:

  • Allow me to save the game at any time
  • Allow me to continue from whatever the last save was with one click
  • Supports more than one save game per session

This is really the base feature set I expect. The ability to save at any time means I can just turn off the machine, instead of having to continue playing until the next checkpoint/and or replay parts of the game next time I get to play again. I get really upset when my OS doesn't allow them to shutdown my machine because it's "installing updates". I don't need games to mimic this behavior. Continuing from the last save point with one click is a similar convenience feature. Both together allow me to enjoy the game and don't force me to replay parts of it.

More than one save game per session is an interesting one. I tend to save games once in a while, like every time I finish a session. It's very uncommon I return to them, but I still like keeping save games at interesting places around. The primary use case is probably to go back and capture some screenshot or try another approach I learn about later on (for instance, in "The Witcher 2", the second chapter is drastically different depending on a choice. I do want to finish the story on one path, then just play the second chapter with the different choice to see the difference. Another great example of how to do this right is "Papers, please", which automatically saves at every single day, and shows you the branches in the save game dialog).

It's also helpful when playing a new game, as sometimes you can dig yourself into a hole you can't get out of, and then returning to a savegame an hour ago may be the best option instead of starting the whole session from scratch. Not to mention trying out something -- what happens when I press this red button? Oh, I see, that wasn't good, glad I don't have to replay the last two hours!

Hard mode!

The problem I've been facing recently in a couple of games are "one save only". The usual reasoning is "hard mode", or something along those lines. While I get the intention, I hope this post helps to understand the position of people like me who will not buy as a result. The non-existence of a good save game system reduces the appeal of a game for me -- and there's no upside for me. If you (designing the game) really like the "hard mode", or it's a rogue like game, or whatever, just add an option to play it the "proper way", but don't force it upon everyone. I can't believe that's going to impact sales, as everyone can still play with just one save game if they want for maximum immersion (see X-Com with the notorious Iron Man mode).

I might really enjoy your game, but I simply don't have the time to replay two hours just to get to the point where I failed. It's not like a game can really prevent me from consuming it my way: I can usually hack around those save game systems with some file monitoring and by exiting/restarting the game. Why make it so difficult?

Here's my plea to all game developers out there: Don't remove standard save game options while designing your game. For customers like me, this is a buy-no-buy decision, because I don't have the time to replay your game over and over just because the save game system is designed for "hard mode" or rogue like. I'm perfectly fine spending my money to see the carefully crafted content you prepared for me just once, and it's my decision to make if there's enough content to warrant the price. Making me practice the game or forcing me to replay parts over and over is not helping my buy decision. Thanks!

Codegen for fast Vulkan

If you're using Vulkan, you might have come across this this document explaining how to get the best calling performance. The gist is that instead of using the entry points provided by the loader, you should query the real entry points using vkGetDeviceProcAddr and use those instead. This can yield significant performance gains when CPU limited, as it avoids an indirection through the loader. Querying all entry points doesn't sound too bad in theory. The problem is there's quite a few of them, so instead of typing this up manually, let's use some code generation to solve the problem!

Where to start?

If we want to auto-generate things, we need to find something machine readable first which we can parse and then use as the data source. Fortunately, the Vulkan specification is also available as an Xml file, as part of the normal repository. Let's grab the vk.xml and see what we can do with that! Now this looks quite promising: We see all types in there as well as all entry points. I'm going to use Python for the script we're about to write, and if you see something like ./types/type, that's XPath syntax to specify the path to the element(s) we're looking at. If you've never used XPath before, don't worry, we'll use very simple XPath only!

Our task is find a all functions that can be loaded using vkGetDeviceProcAddr, stuff them into a structure, and provide some method to query them off the device. Easy enough, let's type up some example code so we know how our result is supposed to look like:

#ifndef VK_DIRECT_4E2E4399D9394222B329DDA74C76DD869EC8B8359E3626DD5706CDEE595FCB2C
#define VK_DIRECT_4E2E4399D9394222B329DDA74C76DD869EC8B8359E3626DD5706CDEE595FCB2C 1

#include <vulkan/vulkan.h>

struct VkDirect
    using FT_vkAllocateMemory = VkResult (VkDevice device, const VkMemoryAllocateInfo* pAllocateInfo, const VkAllocationCallbacks* pAllocator, VkDeviceMemory* pMemory);
    FT_vkAllocateMemory* vkAllocateMemory = nullptr;

    using FT_vkFreeMemory = void (VkDevice device, VkDeviceMemory memory, const VkAllocationCallbacks* pAllocator);
    FT_vkFreeMemory* vkFreeMemory = nullptr;

    // many more functions here

    void Bind (VkDevice device)
        vkAllocateMemory = (FT_vkAllocateMemory*)vkGetDeviceProcAddr (device, "vkAllocateMemory");
        vkFreeMemory = (FT_vkFreeMemory*)vkGetDeviceProcAddr (device, "vkFreeMemory");

        // many more functions here


We see that we need a couple of things to succeed:

  • The functions which can be queried
  • The function signatures

Let's get started with getting the functions!

Getting the types

We want to use vkGetDeviceProcAddr, and according to its documentation, this function is only valid for specific types. Quoting the specification here:

The function pointer must only be called with a dispatchable object (the first parameter) that is device or a child of device.

All right, so we need to find all handle types which are somehow derived from VkDevice. Looking at the Xml, we can see this bit:

<type category="handle" parent="VkDevice"><type>VK_DEFINE_HANDLE</type>(<name>VkQueue</name>)</type>
<type category="handle" parent="VkCommandPool"><type>VK_DEFINE_HANDLE</type>(<name>VkCommandBuffer</name>)</type>

That's quite close to what we want. We note that the name is the handle name, and then we can check the parent until we arrive at VkDevice. If VkDevice is a parent or the type itself is VkDevice, then the type matches our definition and should be included.

Unfortunately, there are two problems: The parents are not necessarily in order in the Xml (so we can't link while we parse), and some objects have multiple parents. Finally, there are also some alias types which don't have a parent at all! To solve this, we're going to build a dictionary of the type and the set of its parents; and at the end we're going to walk the parents recursively for every type. If any of the parents ends up being equal to VkDevice, we have a winner! Let's start typing:

def FindDeviceDispatchableTypes (tree):
    # We search for all types where the category = handle
    handleTypes = tree.findall ('./types/type[@category="handle"]')

    # Ordered dict for determinism
    typeParents = OrderedDict ()

    # for each handle type, we will store the type as the key, and the set of
    # the parents as the value
    for handleType in handleTypes:
        # if it's an alias, we just duplicate
        if 'alias' in handleType.attrib:
            name = handleType.get ('name')
            alias = handleType.get ('alias')

            # This assumes aliases come after the actual type,
            # which is true for vk.xml
            typeParents [name] = typeParents [alias]
            name = handleType.find ('name').text
            parent = handleType.get ('parent')

            # There can be more than one parent
            if parent:
                typeParents [name] = set (parent.split (','))
                typeParents [name] = set ()

    def IsVkDeviceOrDerivedFromVkDevice (handleType, typeParents):
        if handleType == 'VkDevice':
            return True
            parents = typeParents [handleType]
            if parents is None:
                return False
                # If we derive from VkDevice through any path, we're set
                return any ([IsVkDeviceOrDerivedFromVkDevice (parent, typeParents) for parent in parents])

    deviceTypes = {t for t in typeParents.keys () if IsVkDeviceOrDerivedFromVkDevice (t, typeParents)}

    return deviceTypes

We now have the set of handle types. The next step is finding the functions using those.

Device functions

Find the functions could be really complicated if the dispatchable type could be everywhere, as we'd have to check all parameters then. Fortunately, Vulkan specifies that the dispatchable type always comes as the first argument, so we only have to check the first parameter, and if it's in the set we just computed, we're done. We're going to iterate over all ./commands/command entries -- those are the entry points. These look as following:

    <proto><type>VkResult</type> <name>vkAllocateMemory</name></proto>
    <param><type>VkDevice</type> <name>device</name></param>
    <param>const <type>VkMemoryAllocateInfo</type>* <name>pAllocateInfo</name></param>
    <param optional="true">const <type>VkAllocationCallbacks</type>* <name>pAllocator</name></param>
    <param><type>VkDeviceMemory</type>* <name>pMemory</name></param>

We can ignore most of that. What we need is the proto element, which contains the return type and the name, and then the first param element. To build the signature, we also have to flatten the parameters back into plain text. Everything else can be ignored. Let's wrap this into a function which returns the parsed data in an easy-to-digest list of dictionaries:

def FindAllDeviceFunctions (tree, deviceTypes):
    functions = []

    for command in tree.findall ('./commands/command'):
        parameters = command.findall ('param')
        if parameters:
            firstParameter = parameters [0]
            if firstParameter.find ('type').text in deviceTypes:
                function = {
                    'return_type' : command.find ('proto/type').text,
                    'name' : command.find ('proto/name').text,
                    'parameters' : []

                for parameter in parameters:
                    # This flattens ``<param>const <type>T</type> <name>N</name></param>``
                    # to ``const T N``
                    function ['parameters'].append (''.join (parameter.itertext ()))

                functions.append (function)

    return functions

You'd might think that's all we need to stamp them out, but there's one more thing we need to look at before we get going.

Handling #ifdef

If we just dump everything, we'll find out that it compiles fine on Windows (at least for 1.0.69), but on Linux, some entry points are not defined. Turns out, there's quite a few things protected by a platform #define. What we're going to do is to find all those entry points, and wrap them into an #ifdef block.

To find the protected bits, we have to look at the ./extensions. The way this they are structured is as following:

  • /extensions/extension[@protect] -- Each extension with protection has the protect attribute (which is selected using [@protect])
  • Extensions specify entry points in ./require/command

For example, here's one of those protected extensions:

<extension name="VK_KHR_external_memory_win32" number="74" type="device" requires="VK_KHR_external_memory" author="KHR" contact="James Jones @cubanismo" protect="VK_USE_PLATFORM_WIN32_KHR" supported="vulkan">
        <!-- various fields omitted -->
        <command name="vkGetMemoryWin32HandleKHR"/>
        <command name="vkGetMemoryWin32HandlePropertiesKHR"/>

We'll just iterate over all extensions which have some protection, and then invert the index so we're storing the function name as the key, and the protections as the value:

def GetFunctionProtection (tree):
    extensions = tree.findall (f'./extensions/extension[@protect]')

    result = {}

    for extension in extensions:
        protection = extension.get ('protect').split (',')
        for command in extension.findall ('./require/command[@name]'):
            result [command.get ('name')] = protection

    return result

Combining it all

Now we got everything in place, and the only remaining bit is to generate the code. We just iterate over the functions, create the type definitions and fields first. Then we iterate a second time to fill out the bind method. As a bonus, we take the file pointer to write into so we can redirect easily into a file:

def GenerateHeader (tree, functions, protection, outputStream):
    import hashlib
    def Write (s=''):
        print (s, file=outputStream)

    # Same tree will always result in the same hash
    includeUuid = hashlib.sha256(ElementTree.tostring (tree)).hexdigest().upper ()

    Write (f'#ifndef VK_DIRECT_{includeUuid}')
    Write (f'#define VK_DIRECT_{includeUuid} 1')
    Write ()
    Write ('#include <vulkan/vulkan.h>')
    Write ()

    Write ('struct VkDirect')
    Write ('{')

    def UnpackFunction (function):
        return (function ['name'], function ['return_type'], function ['parameters'])

    for function in functions:
        name, return_type, parameters = UnpackFunction (function)

        if name == 'vkGetDeviceProcAddr':

        protect = protection.get (name, None)

        if protect:
            Write (f'#ifdef {" && ".join (protect)}')

        Write (f'\tusing FT_{name} = {return_type} ({", ".join (parameters)});')
        Write (f'\tFT_{name}* {name} = nullptr;')
        if protect:
            Write ('#endif')
        Write ()

    Write ('\tvoid Bind (VkDevice device)')
    Write ('\t{')
    for function in functions:
        name, return_type, parameters = UnpackFunction (function)

        if name == 'vkGetDeviceProcAddr':

        protect = protection.get (name, None)

        if protect:
            Write (f'#ifdef {" && ".join (protect)}')

        Write (f'\t\t{name} = (FT_{name}*)vkGetDeviceProcAddr (device, "{name}");')
        if protect:
            Write ('#endif')

    Write ('\t}')
    Write ('};')
    Write ()
    Write ('#endif')

... and that's it for today. You can find the whole script here -- enjoy!