Skip to main content

Moving from Wordpress to Nikola

If you're a regular reader, you will have noticed that this blog looks remarkably different since today. What happened? Well, I moved the blog from Wordpress to a static page generator -- Nikola. That might come as a surprise as I've been using Wordpress for 11 years now, so let's dig into the reasons.

Quo vadis, Wordpress?

The most important one for me is that I often add code snippets into my blog posts, and the experience of adding code to Wordpress is horrible. I ended up writing my own plugin for Wordpress to teach it code highlighting the way I wanted, but even then, moving between the "visual" and the "text" editor would mess up source formatting. Also, I would frequently run into HTML escaping problems, where my code would end up being full of >.

The other issue I had is backing up my blog. Sure, there's the Export as XML function, which does work, but it doesn't backup your images and so on, so you need to grab a dump of the wp-content folder in addition to the XML export. I'm not sure how many of you do this regularly -- for me it was just a hassle.

Next up is theming. Wordpress used to be an easy to theme system when I started -- but over time, making a theme became more and more complicated. With any update, Wordpress would add minor tweaks which required upgrading the theme again. Eventually, I ended up using the sub-theming capabilities, but even then, theme development in Wordpress remained complicated. My "solution" was a local docker installation of Wordpress and MySQL. That worked, but it was not very comfortable to use. Frankly, at that point, I lost all motivation to do theme related changes.

The final nail in the coffin were encoding issues which left a lot of my posts with “ and other funny errors. Fixing this in Wordpress turned out to be a huge pain, especially as there is no easy way to bulk-process your posts, or for the matter, do anything with your posts.

Not everything is bad with Wordpress though. The visual editor is pretty slick, and when you stick to the WSYIWYG part of Wordpress, it is actually an awesome tool. It's just failing short on my use cases, and judging from the last years of development, it's moving further away into a direction which has very little value add for me. Recently, they added for instance another REST based web-editor, so you have now two visual editors for Wordpress. Unfortunately, source code is still not a "solved" issue, as is authoring files in something else than HTML.

Going static

All of those problems became big enough over time to make me bite the bullet and go to a static page generator. I do like static page generation -- having even written my own static page generator a couple of years back. What I didn't want to do though is to write an importer for Wordpress, deal with auto-updating, theming, and so on, so I looked for a static page generator which would suite me. I'm partial to reStructuredText, and Python, so I ended up with Nikola, which is a Python based static page generator with first-class support for reStructuredText.

It comes with an import_wordpress command, which seemed easy enough, but it turns out you need a bit more post-processing before you can call it a day. Let's start with the importing!

Import & cleanup

Ingesting everything through import_wordpress will give you the content as HTML. Even though the files are called .md, they just contain plain HTML (which is valid Markdown ...). To convert them to "proper" Markdown, I used pandoc:

find . -name *.md | xargs -I {} pandoc --from html --to markdown_strict {} -o {}

That cleanups up most of it, but you'll end up with weird source code. My source code was marked up with [source lang=""], so I had to go through all files with source code and fix them up manually. Sounds like a lot of work, but it's usually quite straightforward as you can just copy & paste from your existing page.

In retrospective, converting everything to reStructuredText might have been a better solution, but frankly, I don't care too much about the "old" content. For new content, I'm using reStructuredText, for old content -- I don't care.


Next up is redirecting your whole blog so your old links continue to work. I like to have "pretty" urls, that is, for a post named /my-awesome-post, I want an URL like /blog/my-awesome-post. This means there has to be a /blog/my-awesome-post/index.html page. By default, the imports will be however /posts/my-awesome-post.html. In order to solve this, you need to do two things:

  • Turn on pretty URLs using: PRETTY_URLS = True
  • Fix up the redirection table, which is stored in REDIRECTIONS in

To fix the redirections table, I used a small Python script to make sure that old URLs like /my-awesome-post were redirected to /blog/my-awesome-post -- I also used the chance to move all blog posts to a /blog subdirectory. Nikola will then generate /my-awesome-post/index.html with a redirection to the new URL.


Finally, the comments - I had a couple hundred in Wordpress, and Nikola, being a static page generator, doesn't have any idea of comments. The solution here is to import them to Disqus which is straightforward. First, you create an account at Disqus, install the Disqus Wordpress plugin, and import your comments into Disqus. Be aware: This will take a while. Finally, you need to teach Disqus the new URLs. This is done using an URL remapping, which is a simple CSV file that contains the original URL and the new one. Again, same exercise as above -- you'll probably want to reuse the REDIRECTIONS for this and dump it out into a CSV.

Closing remarks

Voilà, there you go -- you've ported your blog from Wordpress to Nikola. The remaining steps you'll want to do:

  • Set up some revision control for your blog. I just imported it wholesale into Mercurial with the largefiles extension to store all attachments. Backups: Check!
  • Set up a rsync to upload the blog. By its nature, Nikola generates all files, and you need to synchronize -- some scripting will be handy for this.
  • Fix up all URLs to use / as the prefix. I just did a search-and-replace for everything which didn't continue with wp-content, redirected that to /blog/, and then fixed up the /wp-content ones.

That's it -- let's see if Nikola will serve me for the next 11 years, just like Wordpress did :)

Designing C APIs in 2016

It's 2016, and C APIs are as popular as always. Many libraries are written in C, or provide C APIs, and there are tons of bindings for any language making C the de-facto standard for portable APIs. Yet, a lot of C APIs fail basic design guidelines, and there doesn't seem to have been much progress in recent years in the way we design those APIs. I've been working on a modern C API recently, and then there's also Vulkan with a few fresh design ideas. High time we take a look at what options we have when designing a C based API in 2016!

Design matters

API design matters a lot. I've written about it before, and I still get to use a lot of APIs where I'd like to get onto a plane and have a serious chat with the author. Today we're not going to talk about the basic issues, which are ABI compatibility, sane versioning, error handling and the like - instead we'll look at ways you can expose the API to the client.

My assumption here is you're designing an API which will live in a shared object/DLL. The basic approach to do this in C is to expose your API entry points directly. You just mark them as visible, decide on some naming scheme, and off you go. The client either links directly against your library, or loads the entry points manually, and calls them. This is practically how all C APIs have been designed so far. Look at sqlite, libpng, Win32, the Linux kernel - this is exactly the pattern.

Current problems

So what are the problems with this approach? Well, there's a couple:

  • API versioning
  • API loading
  • Extensibility

Let's tackle those one-by-one.

API versioning

For any API, you'll inevitably run into the issue that you're going to update a function signature. If you care about API and ABI compatibility, that means you need to add a new entry point into your API - the classic reason we see so many myFunctionEx or myFunctionV2. There's no way around this if you expose the entry points directly.

It also means you can't remove an entry point. Client applications can solve that issue if you provide a way to query the API version, but then we're going to run into the next problem - API loading.

In general, a direct C API has no really good way to solve this problem, as every version bump means either new entry points or more complicated loading. Adding a couple new entry points doesn't sound like a big issue, but over time, you'll accumulate lots of new versions and it'll become unclear for developers which one to use.

API loading

API loading covers the question how a user gets started with your API. Often enough, you just link directly against an import library, and then you expect a shared object or DLL exporting the same symbols. This makes it hard to dynamically use the library (i.e. if you want to use it only if needed.) Sure, you can do lazy loading tricks using the linker, but what if you don't have the import library to start with? In this case, you'll end up loading some kind of dispatch library which loads all entry points of your API. This is for instance what the OpenCL loader does, or GLEW. This way, your client is 100% isolated from the library, but it's quite some boilerplate to write.

The solutions for this aim at reducing that boilerplate. GLEW generates all load functions from an XML description, OpenCL just mandates the clients expose a single entry point which fills out a dispatch table. Which brings us to the last topic, extensibility.


How do you extend your API? That is, how can someone add something like a validation layer on top of it? For most C APIs, extensions mean just more entry point loading, but layering is completely ignored.

Vulkan explicitly attacks the layering problem. The solution they came up with allows layers to be chained, which is, layers call into underlying layers. To make this efficient, the chaining can skip several layers, so you don't pay per layer loaded, just per layer that is actually handling an API call. Extensions are still handled using the normal way of querying more API entry points.

Vulkan also has a declarative API version stored in the vk.xml file, which contains all extensions, so they can generate the required function pointer definitions. This reduces the boilerplate a lot, but still requires users to query entry points - though it would be possible to autogenerate a full loader like GLEW does.

Dispatch & generation focused APIs

Thinking about the issues above, I figured that ideally, what we want is:

  • As few entry points as possible, ideally one. This solves the dynamic loading issue, and makes it easy to have one entry point per version.
  • A way to group all functions for one version together. Switching a version would then result in compile-time errors.
  • A way to layer a new set of functions on top of the original API - i.e. the possibility to replace individual entry points.

If you think C++ classes and COM, you're not far off. Let's take a look at the following approach to design an API:

  • You expose a single entry point, which returns the dispatch table for your API directly.
  • The dispatch table contains all entry points for your API.
  • You require clients to pass in the dispatch table or some object pointing to the dispatch table to all entry points.

So how would such an API look like? Here's an example:

struct ImgApi
    int (*LoadPng) (ImgApi* api, const char* filename,
        Image* handle);
    int (*ReadPixels) (ImgApi* api, Image* handle,
        void* target);
    // or
    int (*ReadPixels) (Image* handle, void* target);

    // Various other entry points

// public entry points for V1_0
int CreateMyImgIOApiV1_0 (ImgApi** api);
int DestroyMyImgIOApiV1_0 (ImgApi* api);

Does this thing solve our issues? Let's check:

  • Few entry points - two. Yes, that works, for dynamic and static loading.
  • All functions grouped - check! We can add a ImgApiV2 without breaking older clients, and all changes become compile-time errors.
  • Layering - what do you know, also possible! We just instantiate a new ImgApi, and link it to the original one. In this case, the only difficulty arises from chaining through objects like Image, for which we'll need a way to query the dispatch table pointer from them.

Looks like we got a clear winner here - and indeed, I recently implemented a library using such an API design and the actual implementation is really simple. In particular if you use C++ lambdas, you can fill out a lot of the redirection functions in-line, which is very neat. So what are the downsides? Basically, the fact that you need to call through the dispatch table is the only one I see. This will yield one more indirection, and it's a bit more typing.

Right now my thinking is that if you really need the utmost performance per-call, your API is probably too low-level to start with. Even then, you could still force clients to directly load that one entry point, or provide it from the dispatch table. The more typing issue is generally a non-issue: First of all, any kind of autocompletion will immediately identify what you're doing, and if you really need to, you can auto-generate a C++ class very easily which inlines all forwarding and is simply derived from the dispatch table.

This is also where the generation part comes into play: I think for any API going forward, a declarative description, be it XML, JSON or something else, is a necessity. There's so much you want to auto-generate for the sake of usability that you should be thinking about this from day one.

Right now, this design, combined with a way to generate the dispatch tables, looks to me like the way to go in 2016 and beyond. You get an easy to use API for clients, a lot of freedom to build things on top, while keeping all of the advantages of plain C APIs like portability.

Storing vertex data: To interleave or not to interleave?

Recently, I've been refactoring the geometry storage in my home framework. Among other things, I also looked into vertex attribute storage, which we're going to dive into today.

When it comes to storing vertex data, there's basically two different schools of thought. One says interleave the attributes, that is, store "fat" vertices which contain position, normal, UV coordinates and so on together. I'll refer to this as interleaved storage, as it interleaves all vertex attributes in memory. The other school says all attributes should remain separate, so a vertex consists of multiple streams. Each stream stores one attribute only with tight packing.

Why care?

Let's look where the vertex attribute storage matters:

  • On disk, as compression and read performance may be affected.
  • In memory, as some algorithms prefer one order or the other.
  • At render time, as it affects the required bandwidth and impacts performance on GPUs.


We'll start by looking at the last usage first, which is GPU rendering, as it's the easiest to explain. On the GPU, all APIs allow sourcing vertex attributes from multiple streams or from a single stream. This makes experiments very simple - and also highlights a few key differences.

The first thing that is affected is access flexibility. I have a geometry viewer, which may or may not have all attributes present for one mesh. With interleaved data, it's hard to turn off an attribute, as the vertex layout needs to be adjusted. With de-interleaved data, it's as easy as binding a null buffer or using a shader permutation which just skips the channel. One point for de-interleaved data.

The next use case is position-only rendering, which is very common for shadow maps. Again, de-interleaved data wins here, due to cache efficiency. It's quite easy to see - if you only need positions, you get the best cache and bandwidth utilization if you separate it from the other attributes. With interleaved data, every cache line fetches some other attributes which you throw away immediately. Another point for de-interleaved data.

Vertices packed with gaps into a cache line.
Reading the position from vertices storing the various attributes interleaved. Only four vertices fit into a single cache line, and more than half of the cache line is discarded.

The last point is actually quite important for GPUs. On a GPU compute unit, you have very wide vector units which want to fetch all the same data in a given cycle, for instance, the position. If you have the data de-interleaved, they can fetch it into registers and evict the cache line immediately. You can see that in the figure above. In the first iteration, the red x coordinate is read, then y, and finally z. It takes thus three reads to consume a whole cache line, and it can be evicted right away. For interleaved data, the data has to remain in cache until everything has been read from it, polluting the already small caches - so de-interleaved data will render slightly faster due to better cache utilization.

Vertices tightly packed in a cache line.
Reading the position from a de-interleaved vertex buffer. The positions are stored tightly packed, and the positions of ten vertices can be processed for every cache line fetched from memory.

Is there actually a good reason to use interleaved data for rendering? Actually, I can't think of one, and as it turns out, I changed my geometry viewers to de-interleaved data back a few years ago already and never looked back :)

In the offline rendering world, attributes also have been long specified separately as a ray-tracer mostly cares about positions. For this use case, cache efficiency is most important, so you want to have them separate as well, even on the CPU.


Here's the more interesting part. During the recent refactoring, I changed the mesh view abstraction to take advantage of de-interleaved data when fetching a single attribute. So all algorithms I had in place needed to be refactored to work with both interleaved and de-interleaved data, giving me a good idea of the advantages and disadvantages of each.

Turns out, there's only one algorithm in my toolbox which actually needs interleaved data so much for performance that it will re-interleave things if it encounters a de-interleaved mesh. This algorithm is the re-indexer, which searches for unique vertices, by storing a hash to the vertex and a pointer so it can do exact comparisons.

Except for that algorithm, all others were working on one attribute only to start with, mostly position, and will be now slightly more cache efficient for de-interleaved data. I briefly measured performance, but it turns out, for "slim" vertices with position, normal and maybe one or two more attributes, the cache efficiency differences on CPUs are rather minimal - I'd expect more gains with heavy multi-threading and in bandwidth-restricted cases. The good news is that nothing got slower.

I'd call it a tie, due to the re-indexer. As I expose a pointer and stride to all algorithms now, it's basically trivial to swap between the representations. For the re-indexer, I'm thinking that there must be a better way to represent a vertex than a pointer and the hash, which would also resolve that issue (maybe a stronger hash which does not collide will be enough ...)


So here comes the interesting part. My geometry storage is LZ4 compressed, and with compression, you'd expect interleaved data to loose big time against non-interleaved. After all, all positions will have similar exponent, all normals will have the same exponent, etc., and if they are stored consecutively, a compressor should find more correlation in the data.

Turns out, with the default LZ4 compression, this is not quite true, and interleaved data actually compresses quite a bit better. For testing, I used the XYZRGB Asian dragon, and converted it to my binary format which stores position as 3 floats, and normals as 3 floats as well.

Storage No Idx/Compressed Idx/Compressed Idx/Compressed (HC)
Interleaved 169 MiB 138 MiB 135 MiB
Deinterleaved 189 MiB 138 MiB 132 MiB
It seems that LZ4 is actually able to get a better compression for interleaved data, which duplicates whole vertices and not just a single attribute. With indexed data, it's a wash, and only with the high compression setting, the de-interleaved data pulls ahead.

This is actually really surprising for me and it looks like more analysis is warranted here. One thing that obviously got improved are loading times, as I need to de-interleave for rendering, but the difference is just a couple of percent. This is mostly due to the fact that I bulk load everything into memory, which dominates the I/O time.

So on the storage side, it's one point for de-interleaved data in terms of performance, but one point for interleaved data for basic compression. I guess we can call it a tie!


Overall, the advantages of having a full de-interleaved pipeline outweigh the disadvantages I found on the storage and algorithmic front. As mentioned, except for one algorithm, everything got slightly faster, and storage space is cheap enough for me that I don't care about the few percent bloat there in the general case. For archival storage, I get some benefit with de-interleaved data, so de-interleaved it is :)

Debugging D3D12 fences & queues

Welcome to a hands-on session with DirectX 12. I was recently made aware by Christian of a synchronization problem in my D3D12 sample which required multiple tries to fix (thanks again for reporting this!). The more interesting part is however how to find it without doing a very close code review like Christian did, but by using some tools.

The setup

If you want to follow along, make sure to check out the repository at revision 131a28cf0af5. I don't want to give away too much in one go, so we'll assume right now there is some synchronization issue and we'll debug it step-by-step. Let's start with taking a look using the Visual Studio Graphics Diagnostics. For this, you need to install the Graphics Tools in Windows 10 -- Visual Studio should prompt you to get them when you start the graphics debugging.

Without further ado, let's start the GPU usage analysis. You can find it under "Debug", "Start diagnostic tools without debugging", "GPU Usage". After the application ends, you should see something like this:


Let's select a second or so and use the "view details" button on this. The view you'll get should be roughly similar to the output below.


That's a lot of things going on. To find our application, just click on one of the entries in the table below, and you should find which blocks belong to our application. In my case, I get something like this:


Ok, so what do we see here? Well, the CPU starts after the GPU finishes, with some delay. Also, the GPU 3D queue is very empty, which is not surprising as my GPU is not really taxed with rendering a single triangle :) Due to the fact that we're running VSync'ed, we'd expect to be waiting for the last queued frame to finish before the CPU can queue another frame.

Let's try to look at the very first frame:


Looks like the CPU side is only tracked after the first submission, but what is suspicious is that the GPU frame time looks like a single frame was rendered before the CPU was invoked again. We'd expect the CPU side to queue up three frames though, so the first frame time should be actually three times as long. Can we get a better understanding of what's happening?


Yes, we can, but we'll need another tool for this - GPUView. GPUView is a front-end for ETW, the built-in Windows event tracing, and it hasn't gotten much love. To get it, you need to install the "Windows Performance Toolkit". Also, if you use a non-US locale, you need to prepare an user account with en_US as the locale or it won't work. Let's assume you have everything ready, here's the 1 minute guide to use it:

  1. Fire up an administrator command prompt
  2. Go to C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\gpuview
  3. Run your application
  4. type in log m, Alt+Tab to your application
  5. Let it run a second or two, Alt+Tab back, and type log
  6. Run GPUView on the Merged.etl file.

Just like in the Visual Studio graphics analysis tool, you'll need to select a few milliseconds worth of time before you can make any use of the output. I zoomed in on three frames here.


Notice the color coding for each application is random, so here my sample got dark purple. We can see it executing on the 3D queue, and at the bottom, we see the CPU submission queue.

You'll notice that suspiciously, just while the GPU is busy, the CPU queue is completely empty. That doesn't seem right - we should have several frames queued up, and the moment the GPU starts working (this is right after the VSync, after all!), we should be queuing up another frame.

Let's take a look at the present function. Conceptually, it does:

  1. Call present
  2. Advance to the next buffer
  3. Signal a fence for the current buffer

At the next frame start, we'll wait for the buffer associated with the current queue slot, which happens to be the slot we just used! This means we're waiting for the last frame to finish before we issue a new one, draining the CPU queue, and that's what we see in the GPUView output. Problem found! Fortunately, it's a simple one, as the only thing we need to change is to wait for the right fence. Let's fix this (and also the initial fence values, while we're at it) and check again with GPUView.


Looks better, we see a present packet queued and some data after it. Let's zoom really close on what happens during the rendering.


What do we have here? Two present packets queued up, while the GPU is processing the frame. Here we can also see how long it takes to queue up and submit the data to the GPU. Notice that the total time span we're looking at here is in the order of 0.5 ms!

So finally, we fixed the problem and verified the GPU is no longer going idle but instead, the CPU queue is always nicely filled. While in this example, we're limited by VSync, in general you always want to keep the GPU 100% busy which requires you to have one more frame worth of work queued up. Otherwise, the GPU will wait for the CPU and vice versa, and even a wait of 1 ms on a modern GPU is something in the order of 10 billion FLOPs wasted (in my example, on an AMD Fury X, we're talking about 8601600000 FLOPs per ms!) That's a lot of compute power you really want to throw at your frame :)

5 years of data processing: Lessons learned

During my thesis work, I had to process lots of data. Many meshes I worked on contained hundreds of millions of triangles, and the intermediate and generated outputs would typically range in the tens to hundreds of GiB. All of this means that I had to spend a significant amount of time on "infrastructure" code to ensure that data processing remained fast, reliable and robust.

My work also required me to create many different tools for specific tasks. Over the years, this led to additional challenges in the area of tool creation and tool discovery. In this blog post, I'll take a look at the evolution of my tool infrastructure over roughly five years, and the lessons I learned.

Processing overview

Data processing includes tasks like converting large data sets into more useful formats, cleaning up or extracting data, generate new data and finally modify data. We can group the tools into two broad categories. The first category reads an input and generates an output, while the second mutates the existing data in some way. A typical example for category one is a mesh converter; for category two, think for instance of a tool which computes smooth vertex normals.

Why is it important to make that distinction? Well, in the early days, I did have two kinds of tools. Those in the second category would typically read and write the same file, while those in the first category had well defined inputs and outputs. The idea was that tools which fall into the second category would wind up being more efficient by working in-place. For instance, a tool which computes a level-of-detail simplification of a voxel mesh simply added the level-of-detail data into the original file (as the tool which consumed the data would eventually expect everything to be in a single file.)

Mutating data

Having tools which mutate files lead to all sorts of problems. The main problem I ran into was the inability to chain tools, and the fact that I would often have to regenerate files to undo the mutation. Point in case, the level-of-detail routine would sometimes create wrong blocks, and those can't be easily fixed by re-running the tool with a special "replace" flag. Instead, I had to wipe all level-of-detail data first, followed by re-rerunning the tool again. And that was after I had fixed all bugs which would damage or replace the source data.

Towards functional tools

Over the years, I refactored and rewrote all tools to be side-effect free. That is, they have read-only input data and write one or more outputs. Turns out, this made one optimization mandatory for all file-formats I used: The ability for seek-free, or at least seek-minimal reading. As I mentioned before, the original reason for mutating data in-place was performance. By writing into the same file, I could avoid having to copy over data which was taking a long time for the large data sets I had to work with.

The way I solved this was to ensure that all file formats could be read and written with near-perfect streaming access patterns. Rewriting a file would then be just as fast as copying, and also made processing faster in many cases, to the point that "in-place" mutation was no longer worth it. The biggest offender was the level-of-detail creation, which previously wrote into the same file. Now, it wrote the level-of-detail data into a separate file, and if I wanted to have them all together again, I had to merge them which was only practical once the read/write speed was close to peak disk I/O rates.

At the end, the changes to the file formats to make them "stream-aware" turned out to be quite small. For some things like the geometry streams, they were streams to start with, and for the voxel storage which was basically a filesystem-in-a-file all functions were modified to return entries in disk-offset order. For many clients, this change was totally transparent and immediately improved throughput close to theoretical limits.

Tool creation & discovery

After several years, a big problem I ran into was tool discovery. I had dozens of command-line tools, with several commands each and with lots of command-line options. Figuring out which ones I had and how to use them became an increasingly complicated memory-game. It also increased the time until other users would become productive with the framework as tools were scattered around in the code base. I tried to document them in my framework documentation, but that documentation would rarely match the actual tool. The key issue was that the documentation was in a separate file.

Similarly, creating a new tool would mean to create a new project, add a new command, parse the command-line and call a bunch of functions. Those functions were in the tool binary and could not be easily reused. Moving them over to libraries wasn't an option either, as these functions were typical library consumers and very high-level. And finally, even if I had them all as functions in a library, I would still need a way to find them.

The solution was to implement a new way for tool creation which also solved the tool discovery problem. This turned out to be an exercise in infrastructure work. The key problem was to balance the amount of overhead such that the creation of a tool doesn't become too complicated, but I still get the benefits from the infrastructure.

What I ended up with was levering a lot of my framework's "high-level" object classes, run-time reflection and least-overhead coding. Let's look at the ingredients one-by-one: In my framework, there's a notion of an IObject quite similar to Java or C#, with boxing/unboxing of primitive types. If I could somehow manage to restructure all tool inputs & outputs to fit into the object class hierarchy, this would have allowed me to use all of the reflection I already had in place. Turns out that because the tools are called infrequently, and because inputs are typically files, strings, numbers or arrays, moving this into a class-based, reflection-friendly approach wasn't too hard.

Now I just had to solve the problem how to make a tool easy to discover. For each tool, I need to store some documentation along it. Storing the tool description and documentation separately turned out to be a fail. The solution I ended up with was to embed the declarative part as SJSON right into the source file.

Let's take a look at a full source file for a tool which calls a vertex-cache index optimizer for a chunk:

#include "OptimizeIndices.h"

#include "niven.Geometry.VertexCacheOptimizer.h"

namespace niven {
struct OptimizeIndicesProcessor final : public IGeometryStreamProcessor
    OptimizeIndicesProcessor ()

    bool ProcessChunkImpl (const GeometryStream::Chunk& input,
        GeometryStream::Chunk& output) const
        if (input.GetInfo ().HasIndices ()) {
            const int indexCount = input.GetInfo ().GetIndexCount ();

            HeapArray<int32> indices (indexCount);
            std::copy (
                input.GetIndexDataArrayRef ().begin (),
                input.GetIndexDataArrayRef ().end (),
                indices.Get ());

            Geometry::OptimizeVertexCache (MutableArrayRef<int32> (indices));

            output = input;

            output.SetIndexData (indices);
        } else {
            output = input;

        return true;

name = "OptimizeIndices",
flags = ["None"],
ui = {
    name = "Optimize indices",
    description =
[=[# Optimize indices

Optimizes the indices of an indexed mesh for better vertex cache usage. The input mesh must be already indexed.]=]
inputs = {
    "Input" = {
        type = "Stream",
        ui = {
            name = "Input file",
            description = "Input file."
    "Threads" = {
        type = "Int"
        ui = {
            name = "Threads"
            description = "Number of threads to use for processing."
        default = 1
outputs = {
    "Output" = {
        type = "Stream",
        ui = {
            name = "Output file",
            description = "Output file."

bool OptimizeIndices::ProcessImpl (const Build::ItemContainer& input,
    Build::ItemContainer& output,
    Build::IBuildContext& context)
    const OptimizeIndicesProcessor processor;

    return ProcessGeometryStream (input, output, processor, context);
} // namespace niven

There's a tiny boilerplate header for this which declares the methods, but otherwise it's empty. What do we notice? First, all inputs & outputs are specified right next to the source code using them. In this case, the ProcessGeometryStream method will fetch the input and output streams from the input and output container. All of this is type safe as the declarative types are converted into types used within my framework, and all queries specify the exact type.

It would be also possible to auto-generate a class which fetches the inputs and casts them into the right type, but that never became enough of a problem. This setup -- with the integrated documentation and code -- is what I call "least-overhead" coding. Sure, there is still some overhead to set up a build tool which slightly exceeds the amount of code for a command line tool which parses parameters directly, but the overhead is extremely small -- some declarative structure and that's it. In fact, some tools became smaller because the code to load files into streams and error handling was now handled by the build tool framework.

One interesting tid-bit is that the tool specifies an IStream -- not a concrete implementation. This means I can use for instance a memory-backed stream if I compose tools, or read/write to files if the tool is started stand-alone. Previously, the command line tools could be only composed through files, if at all.

On the other hand, I get the benefits of a common infrastructure. For instance, tool discovery is now easily possible in different formats:

Command line tool discovery.
Command line tool discovery.
Tool help is auto-generated from the declaration.
It's also easy to write a GUI to do the same as well. The description format uses markdown, so can be easily formatted as HTML.
Finally, the tool inputs can be used to create widgets automatically and full-blown UIs.


In hindsight, all of this looks quite obvious -- which is good, as it means the new system is easy to explain. However, during development, all of this was a long evolutionary process. At the beginning, I was trying to keep it simple as much as possible, with as few library, executables and boilerplate as possible. Over time, other parts of the framework also evolved (in particular, the boxing of primitive types which integrated them into the common class hierarchy came pretty late) which affected design decisions. Towards the end, I was also taking more and more advantage of the fact that the code was an integral part of my framework.

By tying it closer to the rest of the code base I could drastically cut down the special-case code in the tool library and reap lots of benefits. The downside is that extracting a tool from the framework is now very hard and will require a lot of work. This is the key tradeoff -- going "all-in" on a framework may mean you have to live inside it. If done correctly, you can get a lot out of this, and I'm leaning more and more towards more infrastructure on projects. Good infrastructure to "plug" into is where large frameworks like Qt or Unreal Engine 4 shine, even if at the beginning, this means there is a steeper learning curve and more overhead. The key in such an evolution is to strive for the simple and obvious though and not introduce complexity for its own sake.

The other key decision -- to move towards state-less, functional building blocks -- turned out to be another big winner in the end. The disadvantages in terms of disk usage and sometimes I/O time were more than offset by testability, robustness and the ability to compose full processing pipelines with ease.