Compiler "optimisation" ...

January 15, 2007

approximately 8 minutes to read

2007, the days of handwritten assembler are a thing of the past, right? Well, not really, if you are still using (otherwise excellent) compilers like VC++ 8.0 or even Intel C++ 9.1.

Vector code, anyone?

Around 2000, Intel decided to invent the SSE instruction set. Seven years later, the compiler writers are still not aware of it! The reference code is:

__declspec(dllexport) void mul4 (__m128& left, const __m128& right)
{
    left = _mm_mul_ps (left, right);
}

which transforms into:

mov    ecx, DWORD PTR _right$[ebp]
movaps  xmm0, XMMWORD PTR [eax]
movaps  xmm1, XMMWORD PTR [ecx]
mulps   xmm0, xmm1
movaps  XMMWORD PTR [eax], xmm0

Note that it’s using movaps because __m128 is properly aligned by default. In the following examples, I’ve been passing a float* so a movups call would be needed to load the four floats at once. Let’s see if we can write the code in such a way the compiler will automagically transform it? At least, that it can invoke the mulps call? The following tests were done with Visual C++ 8.0 SP1. First try, most straightforward:

left[0] *= right [0];
left[1] *= right [1];
left[2] *= right [2];
left[3] *= right [3];

Ok, slightly modified:

 left [0] = left [0] * right [0]; left [1] =
left [1] * right [1]; left [2] = left [2] * right [2]; left
[3] = left [3] * right [3];

Maybe with loop unrolling?

for (int i = 0; i < 4; ++i) {
    left [i] = left [i] * right [i];
}

The compiler unrolled the loop, but that’s all. Next try, other unrolling …

for (int i = 0; i < 4; ++i) {
    left [i] *= right [i];
}

Argh! Maybe with a temporary?

float r[4];
r [0] = left [0] * right [0];
r [1] = left [1] * right [1];
r [2] = left [2] * right [2];
r [3] = left [3] * right [3];
std::copy (r, r+4, left);
return left;

Hmm, using the __m128 data type? This data type is aligned by default, so maybe the compiler heuristics would see that a movaps would be sufficient in this case.

left.m128_f32 [0] *= right.m128_f32 [0];
left.m128_f32 [1] *= right.m128_f32 [1];
left.m128_f32 [2] *= right.m128_f32 [2];
left.m128_f32 [3] *= right.m128_f32 [3];

Unfortunately, this didn’t help either. I’ve tried a bit with restrict on the input, without results. None of the tries led to optimal code - you always wind up with four mulss code instead of a single mulps call which would do the job. Argh! At least, knowing ASM still seperates the men from the boys :)