I’ve been working today on a project, and after the first implementation session I ran it through a profiler to see whether I had some obvious performance bottlenecks. Turned out not to be the case, but looking through the code, I’ve seen some opportunity to reduce the working set size a bit and partition the data so the CPU would work on a smaller part of it. Took quite some while, but I got down to less than 0.000x (the x is there cause the profiler does display only 0.000) misses per instruction, both L1, L2 and TLB, giving a 0.001-0.002% performance penalty for the L1 data misses and 0.000-0.001% for the L2 misses. Some more tuning improved the branch prediction hit rate up to 99.39% (originally, it was slightly below 99% due to the partitioning overhead), making my program overall 50% faster. Note that I didn’t change the underlying algorithms, I just changed how the data is presented to the algorithmic kernel! So even on modern CPUs with large caches and rather small working sets (just a few times bigger than the cache), cache aware code is still a win.