Compiler "optimisation" ...
2007, the days of handwritten assembler are a thing of the past, right? Well, not really, if you are still using (otherwise excellent) compilers like VC++ 8.0 or even Intel C++ 9.1.
Vector code, anyone?
Around 2000, Intel decided to invent the SSE instruction set. Seven years later, the compiler writers are still not aware of it! The reference code is:
__declspec(dllexport) void mul4 (__m128& left, const __m128& right)
{
left = _mm_mul_ps (left, right);
}
which transforms into:
mov ecx, DWORD PTR _right$[ebp]
movaps xmm0, XMMWORD PTR [eax]
movaps xmm1, XMMWORD PTR [ecx]
mulps xmm0, xmm1
movaps XMMWORD PTR [eax], xmm0
Note that it’s using movaps
because __m128
is properly aligned by default. In the following examples, I’ve been passing a float*
so a movups
call would be needed to load the four floats at once. Let’s see if we can write the code in such a way the compiler will automagically transform it? At least, that it can invoke the mulps
call? The following tests were done with Visual C++ 8.0 SP1. First try, most straightforward:
left[0] *= right [0];
left[1] *= right [1];
left[2] *= right [2];
left[3] *= right [3];
Ok, slightly modified:
left [0] = left [0] * right [0]; left [1] =
left [1] * right [1]; left [2] = left [2] * right [2]; left
[3] = left [3] * right [3];
Maybe with loop unrolling?
for (int i = 0; i < 4; ++i) {
left [i] = left [i] * right [i];
}
The compiler unrolled the loop, but that’s all. Next try, other unrolling …
for (int i = 0; i < 4; ++i) {
left [i] *= right [i];
}
Argh! Maybe with a temporary?
float r[4];
r [0] = left [0] * right [0];
r [1] = left [1] * right [1];
r [2] = left [2] * right [2];
r [3] = left [3] * right [3];
std::copy (r, r+4, left);
return left;
Hmm, using the __m128
data type? This data type is aligned by default, so maybe the compiler heuristics would see that a movaps
would be sufficient in this case.
left.m128_f32 [0] *= right.m128_f32 [0];
left.m128_f32 [1] *= right.m128_f32 [1];
left.m128_f32 [2] *= right.m128_f32 [2];
left.m128_f32 [3] *= right.m128_f32 [3];
Unfortunately, this didn’t help either. I’ve tried a bit with restrict
on the input, without results. None of the tries led to optimal code - you always wind up with four mulss
code instead of a single mulps
call which would do the job. Argh! At least, knowing ASM still seperates the men from the boys :)