Anteru's blog
  • Consulting
  • Research
    • Assisted environment probe placement
    • Assisted texture assignment
    • Edge-Friend: Fast and Deterministic Catmull-Clark Subdivision Surfaces
    • Error Metrics for Smart Image Refinement
    • High-Quality Shadows for Streaming Terrain Rendering
    • Hybrid Sample-based Surface Rendering
    • Interactive rendering of Giga-Particle Fluid Simulations
    • Quantitative Analysis of Voxel Raytracing Acceleration Structures
    • Real-time Hybrid Hair Rendering
    • Real-Time Procedural Generation with GPU Work Graphs
    • Scalable rendering for very large meshes
    • Spatiotemporal Variance-Guided Filtering for Motion Blur
    • Subpixel Reconstruction Antialiasing
    • Tiled light trees
    • Towards Practical Meshlet Compression
  • About
  • Archive

C++ Tricks, #3: Beating your compiler at stupid stuff

May 02, 2007
  • Optimisation
  • Programming
approximately 13 minutes to read

Sometimes, you hear people say: “The compilers are all crap. Really pathetic what they do when you want them to do XYZ”. Let’s take a look at how much thruth is in that. Please note that this is just a fun example, don’t try this at home kids, it took me around one hour to squeeze 20% performance out of this function. This is absolutely not worth it, invest that hour into a better algorithm and it’ll give you much larger savings.

Rage against the machine

Let’s take this function, the dot product.

double dot_c (const size_t size,
    double* __restrict a,
    double * __restrict b)
{
    double dot = 0;
    for (size_t i = 0; i < size; ++i)
    {
        dot += (a [i] * b [i]);
    }

    return dot;
}

This being pretty stupid allows the compiler and the CPU to leverage all their hat-tricks. The memory access pattern is so simple even the dumbest prefetcher should understand it. There is no if besides the loop size. And it turns out that VC8 is even able to evaluate that function at compile time ;) (damn, that was a hard time to beat I can tell ya).

Handmade …

I’ll skip over my first hand-made version which was 170% slower than the C version above and go directly to this hand tuned masterpiece ( :) ). Look how simple and elegant it is …

double dot_sse2 (const size_t size,
    double* __restrict a,
    double * __restrict b)
{
    double dot = 0;
    __m128d acc = _mm_setzero_pd (), res = _mm_setzero_pd ();
    __m128d a1l = _mm_setzero_pd (),
        b1l = _mm_setzero_pd (),
        a1h = _mm_setzero_pd (),
        b1h = _mm_setzero_pd (),
        r1 = _mm_setzero_pd (),
        r2 = _mm_setzero_pd ();

    for (size_t i = 0, eights = size / 8; i < eights; ++i) {
        a1l = _mm_loadl_pd (a1l, a+8*i), b1l = _mm_loadl_pd (b1l, b+8*i);
        a1h = _mm_loadl_pd (a1h, a+8*i+1), b1h = _mm_loadl_pd (b1h, b+8*i+1);

        r1 = _mm_mul_sd (a1l, b1l);
        r2 = _mm_mul_sd (a1h, b1h);

        acc = _mm_add_sd (r1, r2);
        res = _mm_add_sd (res, acc);

        a1l = _mm_loadl_pd (a1l, a+8*i+2), b1l = _mm_loadl_pd (b1l, b+8*i+2);
        a1h = _mm_loadl_pd (a1h, a+8*i+3), b1h = _mm_loadl_pd (b1h, b+8*i+3);

        r1 = _mm_mul_sd (a1l, b1l);
        r2 = _mm_mul_sd (a1h, b1h);

        a1l = _mm_loadl_pd (a1l, a+8*i+4), b1l = _mm_loadl_pd (b1l, b+8*i+4);
        a1h = _mm_loadl_pd (a1h, a+8*i+5), b1h = _mm_loadl_pd (b1h, b+8*i+5);

        acc = _mm_add_sd (r1, r2);
        r1 = _mm_mul_sd (a1l, b1l);
        r2 = _mm_mul_sd (a1h, b1h);
        res = _mm_add_sd (res, acc);

        a1l = _mm_loadl_pd (a1l, a+8*i+6), b1l = _mm_loadl_pd (b1l, b+8*i+6);
        a1h = _mm_loadl_pd (a1h, a+8*i+7), b1h = _mm_loadl_pd (b1h, b+8*i+7);

        acc = _mm_add_sd (r1, r2);
        r1 = _mm_mul_sd (a1l, b1l);
        r2 = _mm_mul_sd (a1h, b1h);
        res = _mm_add_sd (res, acc);

        acc = _mm_add_sd (r1, r2);
        res = _mm_add_sd (res, acc);
    }

    for (size_t i = 0, r = size % 8; i < r; ++i) {
        res.m128d_f64 [0] += a [size-i] * b [size-i];
    }

    return res.m128d_f64 [0];
}

Basically, the trick is to mix computation with memory access and shuffle the instructions enough around that the dependencies can be satisfied during the memory access. Makes use of all 8 XMM register.

Benchmark!

There is only one result: Compiler-C-Version: 1.25 sec worst, 1.18 best, my version: 1.0 worst, 0.9 best. Read: I’m up to 35% faster! Tested on Vista x86 @ Turion 1.8 GHz with 2000 iterations over a 65536-element vector.

Conclusion

While it is usually fun and enjoying to do some low level optimization, and it can even give some nice results, it’s usually not worth the hassle. In 99.9% of real-world cases the compiler will create much better code than you can ever hope to write, and unless a profiler tells you that a particular function is a huge bottleneck, don’t even think about doing such low-level tuning, and even then think about better algorithms first before doing this kind of stuff.

Previous post
Next post

Recent posts

  • Data formats: Why CSV and JSON aren't the best
    Posted on 2024-12-29
  • Replacing cron with systemd-timers
    Posted on 2024-04-21
  • Open Source Maintenance
    Posted on 2024-04-02
  • Angular, Caddy, Gunicorn and Django
    Posted on 2023-10-21
  • Effective meetings
    Posted on 2022-09-12
  • Older posts

Find me on the web

  • GitHub
  • GPU database
  • Projects

Follow me

Anteru NIV_Anteru
Contents © 2005-2025
Anteru
Imprint/Impressum
Privacy policy/Datenschutz
Made with Liara
Last updated February 20, 2019