Post schedule, SSE, other stuff

Well, I know, I've been not posting for some weeks now, and I'm getting complaints now about this, so here we go :)

Posting schedule

As you can easily see, my posting schedule is no longer really regular. I tried for some time to keep on at least one post per week, the problem is, I don't want to post about work in progress stuff. So as long as I'm working on something, I usually don't write about it -- just after I finish. For the other kinds of posts, I do them on a case-by-case basis, usually based on user request. So if you want to read about a subject, drop me a line per mail or just comment somewhere, and I'll take a look.

Current work

That said, currently I'm working on some image processing tools. Not finished yet, so no further details there. What might be more interesting is that I've been using SSE2 for a part of this, and so far I'm very pleased with the results. Especially for image processing, SSE2 is a perfect fit, although it does not support 16 bit floats. For 8 bit images however, you can usually process 2 pixels à 4 channels at once. Why 8 elements only? Because many SSE2 operations on integers result in 16 bit integers, so you need pack/unpack, and given that you have only 8 16-bit slots per register, you can't hope for having space for more than 2 pixels.

Here I also gotta say that I'm a happy user of compiler intrinsics. While some people avoid them like the plague, I observed pretty good code generation so far, even if I intentionally used more registers than available so the compiler had to decide where to spill them. Moreover, as the GCC and Intel C++ understand all the same intrisincs, I immediately get portability across x86 and x64 on Windows, Linux and Mac OS X, which is well worth the extra typing.

One word of warning, optimising with SSE2 takes a lot of time. First, you have to write a C baseline version, which must be maintained and tested as a fallback for CPUs which don't have SSE2. After having a correct and verified C version, you can start optimising the C version until you get the feeling the compiler should be able to make great SSE2 out of it. And then you gotta do the SSE2 translation, always looking at the compiler output to make sure it didn't rearrange stuff and such (with the VC 2008 SP1, I didn't have a problem so far -- didn't try with the 2010 CTP yet due to a lack of time). Do it if your profiling tells you so, but don't do it just for fun.

Compiler optimisations, abstraction penalty

Some more notes on writing fast code. First, don't assume a compiler will unroll all those loops over 3-4 elements, some don't, and sometimes this results in slightly slower code. Beware of passing around standard containers, sometimes the compiler cannot construct the container right at the target site, giving you a copy (which is more expensive than you think, as it requires at least one additional allocation, a memcpy, which can be not that fast, and a deallocation).

Absolutely avoid indirection. I couldn't believe it, but even reducing one level of indirection can give you 5-10% of speed. In my case, I was replacing a call via a virtual function by storing the function pointer explicitly. For the virtual function call, the call chain is lookup the object, offset into the vtable, call the target entry. The more direct call is: Jump to the memory address stored here. I was pretty surprised actually, as I assumed that both pointers would be placed in the L1 cache and hence the access cost should be minimal.

Blog theme

Different subject, for some time now, I'm planning to give my blog a total overhaul. If you know a good page with blank templates for Wordpress, I'd be happy to know about it, as I'm looking for a very clean template to start with.

Debugging shaders: Artifacts grouped into quads

If you're debugging a shader, and the wrong pixels come grouped into (screen-aligned) 2x2 blocks, don't look too long at the code at hand -- it might be that your gradients are wrong. Recently, I had a shader aliasing problem, which I couldn't track down properly. Even after hours of debugging with PIX, the shader output was still wrong for some reason. It was not a float-point precision problem, because even with Load (which uses integer coordinates and does no sampling), things wouldn't change.

I went so far that I computed the texture lookups on paper and performed filtering by hand, just to check that the UVs and so were right. Well, everything was right, except the gradients, which were way off and resulting in some blurred pixels - always grouped into 2x2, and aligned with the screen. As I didn't have them computed explicitly, there way no easy way to visualize them, and hence I missed them for several hours straight.

War story: Cache if it you can

The first post in what I hope becomes a series of war stories, right from the trenches. If you have some piece of code which pops up at the top of your profiler output, and you're about to show off your assembler programming skills to get this piece faster ... take a look one up the call chain, maybe there's some place where you can cache the output.

Point in case: By caching a single number, I could double the performance of an application today, and before that, I was already really close to implementing a hash set in pure assembler ...

C++ tricks, #6: Explicit template instantiation

Today we take a look at explicit template instantiations (yet another post which is a direct result of user feedback :) ). With explicit template instantiations, you can define a template and instantiate it in a DLL, so clients don't even have to see the implementation of the template. All you need is a language extension (extern template) which will become part of C++0x, and is currently supported by GCC, Intel and MSVC.

The DLL

Let's assume we have this great class:

template <typename T>
class Container
{
public:
    Container (int size);

    ~Container ();

    T Get (int index) const;

    void Set (int index, T value);

private:
    T* data_;
};

and we want to provide our clients an instance which works with integers only. For this to work, we put the definition into a separate file (you'll see later why), container_impl.h:

template <typename T>
Container<T>::Container (int size)
{
    data_ = new T [size];
}

template <typename T>
Container<T>::~Container ()
{
    delete [] data_;
}

template <typename T>
T Container<T>::Get (int index) const
{
    return data_ [index];
}

template <typename T>
void Container<T>::Set (int index, T value)
{
    data_ [index] = value;
}

Then we include both in a file in our DLL, let's call it container_in.cpp like this:

#include "container.h"
#include "container_impl.h"

template class __declspec(dllexport) Container<int>;

Now clients can already use the integer container, by including only container.h which does not contain the definitions!

More flexibility

But what if clients want to use a float container? With the current solution, they'll get an error that the definition is not available. To get it working, the client has to include container_impl.h as well. But now, Container<int> will be instantiated twice, which leads to an error. The workaround is to define the instantiation as extern template, which means don't generate code for this template even if the definition is available (for some reason or another, I just tried and it also works without the extern with Visual C++ 9). The finished file is:

#ifdef DLL_EXPORT // Set to true when you build the dll
#define API __declspec(dllexport)
#else
#define API __declspec(dllimport)
#endif

template <typename T>
class Container
{
// Container declaration
};

#ifndef IN_ // Note: For VC++, you can leave out the extern
extern template class API Container<int>;
#endif

The container_in.cpp file has to #define IN_ now. The client works like this:

#include "../shared_lib/templ.h"
#include "../shared_lib/templ.hpp"

#include <iostream>

int main (int, char**)
{
    Container<float> c1 (5); // Will be instantiated here

    c1.Set (3, 13.37f);
    std::cout < c1.Get (3) < std::endl;

    Container<int> c2 (4); // Will be linked in from the DLL

    c2.Set (1, 4711);
    std::cout < c2.Get (1) < std::endl;
}

That's it!