# C++ Tricks, #3: Beating your compiler at stupid stuff

Sometimes, you hear people say: "The compilers are all crap. Really
pathetic what they do when you want them to do XYZ". Let's take a look
at how much thruth is in that. Please note that this is just a fun
example, don't try this at home kids, it took me around *one* hour to
squeeze 20% performance out of this function. This is absolutely *not*
worth it, invest that hour into a better algorithm and it'll give you
much larger savings.

### Rage against the machine

Let's take this function, the dot product.

double dot_c (const size_t size, double* __restrict a, double * __restrict b) { double dot = 0; for (size_t i = 0; i < size; ++i) { dot += (a [i] * b [i]); } return dot; }

This being pretty stupid allows the compiler and the CPU to
leverage all their hat-tricks. The memory access pattern is so simple
even the dumbest prefetcher should understand it. There is no `if`

besides the loop size. And it turns out that VC8 is even able to
evaluate that function at compile time ;) (damn, that was a hard time to
beat I can tell ya).

### Handmade ...

I'll skip over my first hand-made version which was 170% slower than the C version above and go directly to this hand tuned masterpiece ( :) ). Look how simple and elegant it is ...

double dot_sse2 (const size_t size, double* __restrict a, double * __restrict b) { double dot = 0; __m128d acc = _mm_setzero_pd (), res = _mm_setzero_pd (); __m128d a1l = _mm_setzero_pd (), b1l = _mm_setzero_pd (), a1h = _mm_setzero_pd (), b1h = _mm_setzero_pd (), r1 = _mm_setzero_pd (), r2 = _mm_setzero_pd (); for (size_t i = 0, eights = size / 8; i < eights; ++i) { a1l = _mm_loadl_pd (a1l, a+8*i), b1l = _mm_loadl_pd (b1l, b+8*i); a1h = _mm_loadl_pd (a1h, a+8*i+1), b1h = _mm_loadl_pd (b1h, b+8*i+1); r1 = _mm_mul_sd (a1l, b1l); r2 = _mm_mul_sd (a1h, b1h); acc = _mm_add_sd (r1, r2); res = _mm_add_sd (res, acc); a1l = _mm_loadl_pd (a1l, a+8*i+2), b1l = _mm_loadl_pd (b1l, b+8*i+2); a1h = _mm_loadl_pd (a1h, a+8*i+3), b1h = _mm_loadl_pd (b1h, b+8*i+3); r1 = _mm_mul_sd (a1l, b1l); r2 = _mm_mul_sd (a1h, b1h); a1l = _mm_loadl_pd (a1l, a+8*i+4), b1l = _mm_loadl_pd (b1l, b+8*i+4); a1h = _mm_loadl_pd (a1h, a+8*i+5), b1h = _mm_loadl_pd (b1h, b+8*i+5); acc = _mm_add_sd (r1, r2); r1 = _mm_mul_sd (a1l, b1l); r2 = _mm_mul_sd (a1h, b1h); res = _mm_add_sd (res, acc); a1l = _mm_loadl_pd (a1l, a+8*i+6), b1l = _mm_loadl_pd (b1l, b+8*i+6); a1h = _mm_loadl_pd (a1h, a+8*i+7), b1h = _mm_loadl_pd (b1h, b+8*i+7); acc = _mm_add_sd (r1, r2); r1 = _mm_mul_sd (a1l, b1l); r2 = _mm_mul_sd (a1h, b1h); res = _mm_add_sd (res, acc); acc = _mm_add_sd (r1, r2); res = _mm_add_sd (res, acc); } for (size_t i = 0, r = size % 8; i < r; ++i) { res.m128d_f64 [0] += a [size-i] * b [size-i]; } return res.m128d_f64 [0]; }

Basically, the trick is to mix computation with memory access and shuffle the instructions enough around that the dependencies can be satisfied during the memory access. Makes use of all 8 XMM register.

### Benchmark!

There is only one result: Compiler-C-Version: 1.25 sec worst, 1.18 best, my version: 1.0 worst, 0.9 best. Read: I'm up to 35% faster! Tested on Vista x86 @ Turion 1.8 GHz with 2000 iterations over a 65536-element vector.

### Conclusion

While it is usually fun and enjoying to do some low level optimization, and it can even give some nice results, it's usually not worth the hassle. In 99.9% of real-world cases the compiler will create much better code than you can ever hope to write, and unless a profiler tells you that a particular function is a huge bottleneck, don't even think about doing such low-level tuning, and even then think about better algorithms first before doing this kind of stuff.