Anteru's blog
  • Consulting
  • Research
    • Assisted environment probe placement
    • Assisted texture assignment
    • Edge-Friend: Fast and Deterministic Catmull-Clark Subdivision Surfaces
    • Error Metrics for Smart Image Refinement
    • High-Quality Shadows for Streaming Terrain Rendering
    • Hybrid Sample-based Surface Rendering
    • Interactive rendering of Giga-Particle Fluid Simulations
    • Quantitative Analysis of Voxel Raytracing Acceleration Structures
    • Real-time Hybrid Hair Rendering
    • Real-Time Procedural Generation with GPU Work Graphs
    • Scalable rendering for very large meshes
    • Spatiotemporal Variance-Guided Filtering for Motion Blur
    • Subpixel Reconstruction Antialiasing
    • Tiled light trees
    • Towards Practical Meshlet Compression
  • About
  • Archive

Compiler "optimisation" ...

January 15, 2007
  • Optimisation
  • Programming
approximately 8 minutes to read

2007, the days of handwritten assembler are a thing of the past, right? Well, not really, if you are still using (otherwise excellent) compilers like VC++ 8.0 or even Intel C++ 9.1.

Vector code, anyone?

Around 2000, Intel decided to invent the SSE instruction set. Seven years later, the compiler writers are still not aware of it! The reference code is:

__declspec(dllexport) void mul4 (__m128& left, const __m128& right)
{
    left = _mm_mul_ps (left, right);
}

which transforms into:

mov    ecx, DWORD PTR _right$[ebp]
movaps  xmm0, XMMWORD PTR [eax]
movaps  xmm1, XMMWORD PTR [ecx]
mulps   xmm0, xmm1
movaps  XMMWORD PTR [eax], xmm0

Note that it’s using movaps because __m128 is properly aligned by default. In the following examples, I’ve been passing a float* so a movups call would be needed to load the four floats at once. Let’s see if we can write the code in such a way the compiler will automagically transform it? At least, that it can invoke the mulps call? The following tests were done with Visual C++ 8.0 SP1. First try, most straightforward:

left[0] *= right [0];
left[1] *= right [1];
left[2] *= right [2];
left[3] *= right [3];

Ok, slightly modified:

 left [0] = left [0] * right [0]; left [1] =
left [1] * right [1]; left [2] = left [2] * right [2]; left
[3] = left [3] * right [3];

Maybe with loop unrolling?

for (int i = 0; i < 4; ++i) {
    left [i] = left [i] * right [i];
}

The compiler unrolled the loop, but that’s all. Next try, other unrolling …

for (int i = 0; i < 4; ++i) {
    left [i] *= right [i];
}

Argh! Maybe with a temporary?

float r[4];
r [0] = left [0] * right [0];
r [1] = left [1] * right [1];
r [2] = left [2] * right [2];
r [3] = left [3] * right [3];
std::copy (r, r+4, left);
return left;

Hmm, using the __m128 data type? This data type is aligned by default, so maybe the compiler heuristics would see that a movaps would be sufficient in this case.

left.m128_f32 [0] *= right.m128_f32 [0];
left.m128_f32 [1] *= right.m128_f32 [1];
left.m128_f32 [2] *= right.m128_f32 [2];
left.m128_f32 [3] *= right.m128_f32 [3];

Unfortunately, this didn’t help either. I’ve tried a bit with restrict on the input, without results. None of the tries led to optimal code - you always wind up with four mulss code instead of a single mulps call which would do the job. Argh! At least, knowing ASM still seperates the men from the boys :)

Previous post
Next post

Recent posts

  • Data formats: Why CSV and JSON aren't the best
    Posted on 2024-12-29
  • Replacing cron with systemd-timers
    Posted on 2024-04-21
  • Open Source Maintenance
    Posted on 2024-04-02
  • Angular, Caddy, Gunicorn and Django
    Posted on 2023-10-21
  • Effective meetings
    Posted on 2022-09-12
  • Older posts

Find me on the web

  • GitHub
  • GPU database
  • Projects

Follow me

Anteru NIV_Anteru
Contents © 2005-2025
Anteru
Imprint/Impressum
Privacy policy/Datenschutz
Made with Liara
Last updated February 20, 2019