C++ Tricks, #3: Beating your compiler at stupid stuff
Sometimes, you hear people say: “The compilers are all crap. Really pathetic what they do when you want them to do XYZ”. Let’s take a look at how much thruth is in that. Please note that this is just a fun example, don’t try this at home kids, it took me around one hour to squeeze 20% performance out of this function. This is absolutely not worth it, invest that hour into a better algorithm and it’ll give you much larger savings.
Rage against the machine
Let’s take this function, the dot product.
double dot_c (const size_t size,
double* __restrict a,
double * __restrict b)
{
double dot = 0;
for (size_t i = 0; i < size; ++i)
{
dot += (a [i] * b [i]);
}
return dot;
}
This being pretty stupid allows the compiler and the CPU to leverage all their hat-tricks. The memory access pattern is so simple even the dumbest prefetcher should understand it. There is no if
besides the loop size. And it turns out that VC8 is even able to evaluate that function at compile time ;) (damn, that was a hard time to beat I can tell ya).
Handmade …
I’ll skip over my first hand-made version which was 170% slower than the C version above and go directly to this hand tuned masterpiece ( :) ). Look how simple and elegant it is …
double dot_sse2 (const size_t size,
double* __restrict a,
double * __restrict b)
{
double dot = 0;
__m128d acc = _mm_setzero_pd (), res = _mm_setzero_pd ();
__m128d a1l = _mm_setzero_pd (),
b1l = _mm_setzero_pd (),
a1h = _mm_setzero_pd (),
b1h = _mm_setzero_pd (),
r1 = _mm_setzero_pd (),
r2 = _mm_setzero_pd ();
for (size_t i = 0, eights = size / 8; i < eights; ++i) {
a1l = _mm_loadl_pd (a1l, a+8*i), b1l = _mm_loadl_pd (b1l, b+8*i);
a1h = _mm_loadl_pd (a1h, a+8*i+1), b1h = _mm_loadl_pd (b1h, b+8*i+1);
r1 = _mm_mul_sd (a1l, b1l);
r2 = _mm_mul_sd (a1h, b1h);
acc = _mm_add_sd (r1, r2);
res = _mm_add_sd (res, acc);
a1l = _mm_loadl_pd (a1l, a+8*i+2), b1l = _mm_loadl_pd (b1l, b+8*i+2);
a1h = _mm_loadl_pd (a1h, a+8*i+3), b1h = _mm_loadl_pd (b1h, b+8*i+3);
r1 = _mm_mul_sd (a1l, b1l);
r2 = _mm_mul_sd (a1h, b1h);
a1l = _mm_loadl_pd (a1l, a+8*i+4), b1l = _mm_loadl_pd (b1l, b+8*i+4);
a1h = _mm_loadl_pd (a1h, a+8*i+5), b1h = _mm_loadl_pd (b1h, b+8*i+5);
acc = _mm_add_sd (r1, r2);
r1 = _mm_mul_sd (a1l, b1l);
r2 = _mm_mul_sd (a1h, b1h);
res = _mm_add_sd (res, acc);
a1l = _mm_loadl_pd (a1l, a+8*i+6), b1l = _mm_loadl_pd (b1l, b+8*i+6);
a1h = _mm_loadl_pd (a1h, a+8*i+7), b1h = _mm_loadl_pd (b1h, b+8*i+7);
acc = _mm_add_sd (r1, r2);
r1 = _mm_mul_sd (a1l, b1l);
r2 = _mm_mul_sd (a1h, b1h);
res = _mm_add_sd (res, acc);
acc = _mm_add_sd (r1, r2);
res = _mm_add_sd (res, acc);
}
for (size_t i = 0, r = size % 8; i < r; ++i) {
res.m128d_f64 [0] += a [size-i] * b [size-i];
}
return res.m128d_f64 [0];
}
Basically, the trick is to mix computation with memory access and shuffle the instructions enough around that the dependencies can be satisfied during the memory access. Makes use of all 8 XMM register.
Benchmark!
There is only one result: Compiler-C-Version: 1.25 sec worst, 1.18 best, my version: 1.0 worst, 0.9 best. Read: I’m up to 35% faster! Tested on Vista x86 @ Turion 1.8 GHz with 2000 iterations over a 65536-element vector.
Conclusion
While it is usually fun and enjoying to do some low level optimization, and it can even give some nice results, it’s usually not worth the hassle. In 99.9% of real-world cases the compiler will create much better code than you can ever hope to write, and unless a profiler tells you that a particular function is a huge bottleneck, don’t even think about doing such low-level tuning, and even then think about better algorithms first before doing this kind of stuff.