Anteru's blog
  • Consulting
  • Research
    • Assisted environment probe placement
    • Assisted texture assignment
    • Edge-Friend: Fast and Deterministic Catmull-Clark Subdivision Surfaces
    • Error Metrics for Smart Image Refinement
    • High-Quality Shadows for Streaming Terrain Rendering
    • Hybrid Sample-based Surface Rendering
    • Interactive rendering of Giga-Particle Fluid Simulations
    • Quantitative Analysis of Voxel Raytracing Acceleration Structures
    • Real-time Hybrid Hair Rendering
    • Real-Time Procedural Generation with GPU Work Graphs
    • Scalable rendering for very large meshes
    • Spatiotemporal Variance-Guided Filtering for Motion Blur
    • Subpixel Reconstruction Antialiasing
    • Tiled light trees
    • Towards Practical Meshlet Compression
  • About
  • Archive

Post schedule, SSE, other stuff

December 19, 2008
  • Optimisation
  • Programming
approximately 3 minutes to read

Well, I know, I’ve been not posting for some weeks now, and I’m getting complaints now about this, so here we go :)

Posting schedule

As you can easily see, my posting schedule is no longer really regular. I tried for some time to keep on at least one post per week, the problem is, I don’t want to post about work in progress stuff. So as long as I’m working on something, I usually don’t write about it – just after I finish. For the other kinds of posts, I do them on a case-by-case basis, usually based on user request. So if you want to read about a subject, drop me a line per mail or just comment somewhere, and I’ll take a look.

Current work

That said, currently I’m working on some image processing tools. Not finished yet, so no further details there. What might be more interesting is that I’ve been using SSE2 for a part of this, and so far I’m very pleased with the results. Especially for image processing, SSE2 is a perfect fit, although it does not support 16 bit floats. For 8 bit images however, you can usually process 2 pixels à 4 channels at once. Why 8 elements only? Because many SSE2 operations on integers result in 16 bit integers, so you need pack/unpack, and given that you have only 8 16-bit slots per register, you can’t hope for having space for more than 2 pixels.

Here I also gotta say that I’m a happy user of compiler intrinsics. While some people avoid them like the plague, I observed pretty good code generation so far, even if I intentionally used more registers than available so the compiler had to decide where to spill them. Moreover, as the GCC and Intel C++ understand all the same intrisincs, I immediately get portability across x86 and x64 on Windows, Linux and Mac OS X, which is well worth the extra typing.

One word of warning, optimising with SSE2 takes a lot of time. First, you have to write a C baseline version, which must be maintained and tested as a fallback for CPUs which don’t have SSE2. After having a correct and verified C version, you can start optimising the C version until you get the feeling the compiler should be able to make great SSE2 out of it. And then you gotta do the SSE2 translation, always looking at the compiler output to make sure it didn’t rearrange stuff and such (with the VC 2008 SP1, I didn’t have a problem so far – didn’t try with the 2010 CTP yet due to a lack of time). Do it if your profiling tells you so, but don’t do it just for fun.

Compiler optimisations, abstraction penalty

Some more notes on writing fast code. First, don’t assume a compiler will unroll all those loops over 3-4 elements, some don’t, and sometimes this results in slightly slower code. Beware of passing around standard containers, sometimes the compiler cannot construct the container right at the target site, giving you a copy (which is more expensive than you think, as it requires at least one additional allocation, a memcpy, which can be not that fast, and a deallocation).

Absolutely avoid indirection. I couldn’t believe it, but even reducing one level of indirection can give you 5-10% of speed. In my case, I was replacing a call via a virtual function by storing the function pointer explicitly. For the virtual function call, the call chain is lookup the object, offset into the vtable, call the target entry. The more direct call is: Jump to the memory address stored here. I was pretty surprised actually, as I assumed that both pointers would be placed in the L1 cache and hence the access cost should be minimal.

Blog theme

Different subject, for some time now, I’m planning to give my blog a total overhaul. If you know a good page with blank templates for Wordpress, I’d be happy to know about it, as I’m looking for a very clean template to start with.

Previous post
Next post

Recent posts

  • Data formats: Why CSV and JSON aren't the best
    Posted on 2024-12-29
  • Replacing cron with systemd-timers
    Posted on 2024-04-21
  • Open Source Maintenance
    Posted on 2024-04-02
  • Angular, Caddy, Gunicorn and Django
    Posted on 2023-10-21
  • Effective meetings
    Posted on 2022-09-12
  • Older posts

Find me on the web

  • GitHub
  • GPU database
  • Projects

Follow me

Anteru NIV_Anteru
Contents © 2005-2025
Anteru
Imprint/Impressum
Privacy policy/Datenschutz
Made with Liara
Last updated February 03, 2019