Well, I know, I've been not posting for some weeks now, and I'm getting complaints now about this, so here we go :)
As you can easily see, my posting schedule is no longer really regular. I tried for some time to keep on at least one post per week, the problem is, I don't want to post about work in progress stuff. So as long as I'm working on something, I usually don't write about it -- just after I finish. For the other kinds of posts, I do them on a case-by-case basis, usually based on user request. So if you want to read about a subject, drop me a line per mail or just comment somewhere, and I'll take a look.
That said, currently I'm working on some image processing tools. Not finished yet, so no further details there. What might be more interesting is that I've been using SSE2 for a part of this, and so far I'm very pleased with the results. Especially for image processing, SSE2 is a perfect fit, although it does not support 16 bit floats. For 8 bit images however, you can usually process 2 pixels à 4 channels at once. Why 8 elements only? Because many SSE2 operations on integers result in 16 bit integers, so you need pack/unpack, and given that you have only 8 16-bit slots per register, you can't hope for having space for more than 2 pixels.
Here I also gotta say that I'm a happy user of compiler intrinsics. While some people avoid them like the plague, I observed pretty good code generation so far, even if I intentionally used more registers than available so the compiler had to decide where to spill them. Moreover, as the GCC and Intel C++ understand all the same intrisincs, I immediately get portability across x86 and x64 on Windows, Linux and Mac OS X, which is well worth the extra typing.
One word of warning, optimising with SSE2 takes a lot of time. First, you have to write a C baseline version, which must be maintained and tested as a fallback for CPUs which don't have SSE2. After having a correct and verified C version, you can start optimising the C version until you get the feeling the compiler should be able to make great SSE2 out of it. And then you gotta do the SSE2 translation, always looking at the compiler output to make sure it didn't rearrange stuff and such (with the VC 2008 SP1, I didn't have a problem so far -- didn't try with the 2010 CTP yet due to a lack of time). Do it if your profiling tells you so, but don't do it just for fun.
Compiler optimisations, abstraction penalty
Some more notes on writing fast code. First, don't assume a compiler will unroll all those loops over 3-4 elements, some don't, and sometimes this results in slightly slower code. Beware of passing around standard containers, sometimes the compiler cannot construct the container right at the target site, giving you a copy (which is more expensive than you think, as it requires at least one additional allocation, a memcpy, which can be not that fast, and a deallocation).
Absolutely avoid indirection. I couldn't believe it, but even reducing one level of indirection can give you 5-10% of speed. In my case, I was replacing a call via a virtual function by storing the function pointer explicitly. For the virtual function call, the call chain is lookup the object, offset into the vtable, call the target entry. The more direct call is: Jump to the memory address stored here. I was pretty surprised actually, as I assumed that both pointers would be placed in the L1 cache and hence the access cost should be minimal.
Different subject, for some time now, I'm planning to give my blog a total overhaul. If you know a good page with blank templates for Wordpress, I'd be happy to know about it, as I'm looking for a very clean template to start with.