# C++ Tricks, #3: Beating your compiler at stupid stuff

Sometimes, you hear people say: “The compilers are all crap. Really pathetic what they do when you want them to do XYZ”. Let’s take a look at how much thruth is in that. Please note that this is just a fun example, don’t try this at home kids, it took me around *one* hour to squeeze 20% performance out of this function. This is absolutely *not* worth it, invest that hour into a better algorithm and it’ll give you much larger savings.

## Rage against the machine

Let’s take this function, the dot product.

```
double dot_c (const size_t size,
double* __restrict a,
double * __restrict b)
{
double dot = 0;
for (size_t i = 0; i < size; ++i)
{
dot += (a [i] * b [i]);
}
return dot;
}
```

This being pretty stupid allows the compiler and the CPU to leverage all their hat-tricks. The memory access pattern is so simple even the dumbest prefetcher should understand it. There is no `if`

besides the loop size. And it turns out that VC8 is even able to evaluate that function at compile time ;) (damn, that was a hard time to beat I can tell ya).

## Handmade …

I’ll skip over my first hand-made version which was 170% slower than the C version above and go directly to this hand tuned masterpiece ( :) ). Look how simple and elegant it is …

```
double dot_sse2 (const size_t size,
double* __restrict a,
double * __restrict b)
{
double dot = 0;
__m128d acc = _mm_setzero_pd (), res = _mm_setzero_pd ();
__m128d a1l = _mm_setzero_pd (),
b1l = _mm_setzero_pd (),
a1h = _mm_setzero_pd (),
b1h = _mm_setzero_pd (),
r1 = _mm_setzero_pd (),
r2 = _mm_setzero_pd ();
for (size_t i = 0, eights = size / 8; i < eights; ++i) {
a1l = _mm_loadl_pd (a1l, a+8*i), b1l = _mm_loadl_pd (b1l, b+8*i);
a1h = _mm_loadl_pd (a1h, a+8*i+1), b1h = _mm_loadl_pd (b1h, b+8*i+1);
r1 = _mm_mul_sd (a1l, b1l);
r2 = _mm_mul_sd (a1h, b1h);
acc = _mm_add_sd (r1, r2);
res = _mm_add_sd (res, acc);
a1l = _mm_loadl_pd (a1l, a+8*i+2), b1l = _mm_loadl_pd (b1l, b+8*i+2);
a1h = _mm_loadl_pd (a1h, a+8*i+3), b1h = _mm_loadl_pd (b1h, b+8*i+3);
r1 = _mm_mul_sd (a1l, b1l);
r2 = _mm_mul_sd (a1h, b1h);
a1l = _mm_loadl_pd (a1l, a+8*i+4), b1l = _mm_loadl_pd (b1l, b+8*i+4);
a1h = _mm_loadl_pd (a1h, a+8*i+5), b1h = _mm_loadl_pd (b1h, b+8*i+5);
acc = _mm_add_sd (r1, r2);
r1 = _mm_mul_sd (a1l, b1l);
r2 = _mm_mul_sd (a1h, b1h);
res = _mm_add_sd (res, acc);
a1l = _mm_loadl_pd (a1l, a+8*i+6), b1l = _mm_loadl_pd (b1l, b+8*i+6);
a1h = _mm_loadl_pd (a1h, a+8*i+7), b1h = _mm_loadl_pd (b1h, b+8*i+7);
acc = _mm_add_sd (r1, r2);
r1 = _mm_mul_sd (a1l, b1l);
r2 = _mm_mul_sd (a1h, b1h);
res = _mm_add_sd (res, acc);
acc = _mm_add_sd (r1, r2);
res = _mm_add_sd (res, acc);
}
for (size_t i = 0, r = size % 8; i < r; ++i) {
res.m128d_f64 [0] += a [size-i] * b [size-i];
}
return res.m128d_f64 [0];
}
```

Basically, the trick is to mix computation with memory access and shuffle the instructions enough around that the dependencies can be satisfied during the memory access. Makes use of all 8 XMM register.

## Benchmark!

There is only one result: Compiler-C-Version: 1.25 sec worst, 1.18 best, my version: 1.0 worst, 0.9 best. Read: I’m up to 35% faster! Tested on Vista x86 @ Turion 1.8 GHz with 2000 iterations over a 65536-element vector.

## Conclusion

While it is usually fun and enjoying to do some low level optimization, and it can even give some nice results, it’s usually not worth the hassle. In 99.9% of real-world cases the compiler will create much better code than you can ever hope to write, and unless a profiler tells you that a particular function is a huge bottleneck, don’t even think about doing such low-level tuning, and even then think about better algorithms first before doing this kind of stuff.