AMD releases "Barcelona" software optimization guide

AMD just released the software optimization guide for their K10 ("Barcelona") CPU line. You can get it from here. Seems the SSE units are really 128 bit wide (as expected). They improved the instruction fetching to fill the wider execution units (this was already a bottleneck on the K8 architecture, nice to see it fixed). The decoders seem to be better and some instructions are faster now. A "sideband stack optimizer" was added to improve PUSH/POP instructions. Support for large pages has been improved with 1 GiB large pages. All in all it looks like an evolutionary overhaul, I'm curious to see it benchmarked.

Test driven

Since half a year, I'm reading a lot about "modern" software engineering approaches, like agile development practices, XP and test-driven development. Read on for a short story how I work today.

This is how I write my code currently, and I'm pretty productive that way, but your mileage may vary.

Usually, I'm working in short sprints nowadays, I try to keep those sprints around 2 hours. Before I start a sprint, I think about what I want to have at the end, usually some specific functionality. It's just a rough idea what should work, no concept paper, though I tend to write up a short note before implementing a larger system, covering the main classes and their responsibilities.

At the beginning, I just write a skeleton class header and define the types I think I'm gonna need, and the helper objects definitions, just enough that it compiles. There is no code that does something, so this is usually really quick (like 5 minutes or so). Note that I don't implement the functions so calling them will result in a link error.

Then, I start to write unit tests for helper classes, one for each function, of course only if there are helper classes that I'm aware of beforehand (often, the helper classes appear during a later refactoring). This takes me around 10 minutes or so per function, including the unit test. If I end up using some other function, I'm getting a linker error so I can see if I called the function properly or whether I thought the syntax is different (this happens most of the time - you have a function called doX in your interface but in your test case you think it's called getQ, this is your chance to fix this, most of the time you were wrong when defining the interface), and I can go ahead and implement the missing ones right away. I check in the code at each step, meaning once after the interface, then after each test, then after implementing the function, and then after some bug fixing or test expanding.

I'm trying to keep the tests simple, just a few lines, for most helper classes, they are 2-3 lines long at most. I try to test border cases, too, it's quite natural to have one test like "PopStack" and write a "PopEmptyStackThrows" just along with it to see what happens. Here, a crash costs you maybe a minute to fix it, later, you could end up spending an hour tracking it down. And for some reason or another, it is fun to test such border cases :)

After the functionality is pretty much done, I do a refactoring pass. During the implementation, I often end up with duplicate code, and in this step, I try to clean it up as much as possible. With the unit tests in place, I can be pretty sure that I don't break stuff in this phase. I add a few comments now for stuff that is not obvious. Basically, the goal is to have solid and polished code that is suitable for further development, but it's not top-notch as it has been just used by the unit tests mainly and not by "real" code.

As soon as the helper classes are in place, I start with the main class. I try to focus on a single function, just as with the helper classes, and get this working. I try to have at least two tests for each function, one where it works, and one which shows what happens in case of an error. The latter one is actually the more important one, as it is the one which makes the code robust, as you tend to forget what will be treated as an error usually. It's actually just the same as with the helper classes. I tend to go bottom up by using only functionality that I have tested already. This means usually to start with the simplest functions first and work towards the complex ones. If I need a new helper function, I test it first before using it somewhere else, so I have less trouble if a test fails.

The last step - as soon as the whole class works - is yet another refactoring step to consolidate functionality and extract utility functions. In this step, I usually polish the docs, document invariants, and add further checks for error cases (with tests). After this step, I consider the code production ready and move on - the code in this stage has each function documented, with pre/post conditions, unit tests for each function, has been refactored once or twice, and is of pretty high quality.

In case I discover a bug later (by other code that depends on this class or functionality), I always write a test case for the affected class first (and check it in) before fixing it. I also look through all similar code to see if some other part is affected by the same bug (this is usually a sign that some refactoring is needed).

C++ Tricks, #3: Beating your compiler at stupid stuff

Sometimes, you hear people say: "The compilers are all crap. Really pathetic what they do when you want them to do XYZ". Let's take a look at how much thruth is in that. Please note that this is just a fun example, don't try this at home kids, it took me around one hour to squeeze 20% performance out of this function. This is absolutely not worth it, invest that hour into a better algorithm and it'll give you much larger savings.

Rage against the machine

Let's take this function, the dot product.

double dot_c (const size_t size,
    double* __restrict a,
    double * __restrict b)
{
    double dot = 0;
    for (size_t i = 0; i < size; ++i)
    {
        dot += (a [i] * b [i]);
    }

    return dot;
}

This being pretty stupid allows the compiler and the CPU to leverage all their hat-tricks. The memory access pattern is so simple even the dumbest prefetcher should understand it. There is no if besides the loop size. And it turns out that VC8 is even able to evaluate that function at compile time ;) (damn, that was a hard time to beat I can tell ya).

Handmade ...

I'll skip over my first hand-made version which was 170% slower than the C version above and go directly to this hand tuned masterpiece ( :) ). Look how simple and elegant it is ...

double dot_sse2 (const size_t size,
    double* __restrict a,
    double * __restrict b)
{
    double dot = 0;
    __m128d acc = _mm_setzero_pd (), res = _mm_setzero_pd ();
    __m128d a1l = _mm_setzero_pd (),
        b1l = _mm_setzero_pd (),
        a1h = _mm_setzero_pd (),
        b1h = _mm_setzero_pd (),
        r1 = _mm_setzero_pd (),
        r2 = _mm_setzero_pd ();

    for (size_t i = 0, eights = size / 8; i < eights; ++i) {
        a1l = _mm_loadl_pd (a1l, a+8*i), b1l = _mm_loadl_pd (b1l, b+8*i);
        a1h = _mm_loadl_pd (a1h, a+8*i+1), b1h = _mm_loadl_pd (b1h, b+8*i+1);

        r1 = _mm_mul_sd (a1l, b1l);
        r2 = _mm_mul_sd (a1h, b1h);

        acc = _mm_add_sd (r1, r2);
        res = _mm_add_sd (res, acc);

        a1l = _mm_loadl_pd (a1l, a+8*i+2), b1l = _mm_loadl_pd (b1l, b+8*i+2);
        a1h = _mm_loadl_pd (a1h, a+8*i+3), b1h = _mm_loadl_pd (b1h, b+8*i+3);

        r1 = _mm_mul_sd (a1l, b1l);
        r2 = _mm_mul_sd (a1h, b1h);

        a1l = _mm_loadl_pd (a1l, a+8*i+4), b1l = _mm_loadl_pd (b1l, b+8*i+4);
        a1h = _mm_loadl_pd (a1h, a+8*i+5), b1h = _mm_loadl_pd (b1h, b+8*i+5);

        acc = _mm_add_sd (r1, r2);
        r1 = _mm_mul_sd (a1l, b1l);
        r2 = _mm_mul_sd (a1h, b1h);
        res = _mm_add_sd (res, acc);

        a1l = _mm_loadl_pd (a1l, a+8*i+6), b1l = _mm_loadl_pd (b1l, b+8*i+6);
        a1h = _mm_loadl_pd (a1h, a+8*i+7), b1h = _mm_loadl_pd (b1h, b+8*i+7);

        acc = _mm_add_sd (r1, r2);
        r1 = _mm_mul_sd (a1l, b1l);
        r2 = _mm_mul_sd (a1h, b1h);
        res = _mm_add_sd (res, acc);

        acc = _mm_add_sd (r1, r2);
        res = _mm_add_sd (res, acc);
    }

    for (size_t i = 0, r = size % 8; i < r; ++i) {
        res.m128d_f64 [0] += a [size-i] * b [size-i];
    }

    return res.m128d_f64 [0];
}

Basically, the trick is to mix computation with memory access and shuffle the instructions enough around that the dependencies can be satisfied during the memory access. Makes use of all 8 XMM register.

Benchmark!

There is only one result: Compiler-C-Version: 1.25 sec worst, 1.18 best, my version: 1.0 worst, 0.9 best. Read: I'm up to 35% faster! Tested on Vista x86 @ Turion 1.8 GHz with 2000 iterations over a 65536-element vector.

Conclusion

While it is usually fun and enjoying to do some low level optimization, and it can even give some nice results, it's usually not worth the hassle. In 99.9% of real-world cases the compiler will create much better code than you can ever hope to write, and unless a profiler tells you that a particular function is a huge bottleneck, don't even think about doing such low-level tuning, and even then think about better algorithms first before doing this kind of stuff.

C++ tricks, #2: Parameter passing

From time to time, I'm taking a look at Aqsis, and recently I started a discussion about passing parameters by const reference instead of value for integral types, under the assumption that it is just as fast (as the compiler notices hey this type fits in a register, don't bother passing a pointer ...). I was quite sure I have seen this with VC7.1. However, Chris Foster wrote a small test case that showed there is indeed a difference, from 1-2% with VC8 (and the generated ASM is different) to 5% (GCC4) to 100% (GCC 3.4) So, passing floats by const reference is no good if you want fast code (except you're using Intel C++, this one did optimize it). If you are writing generic code, where you don't know how large your types are, there is a Boost.org provided solution that I totally forgot (thanks Andrew J. Bromage for pointing it out) that selects the optimal type for parameter passing depending on whether the type is smaller than a pointer or not: Call traits, which work great and do their job.