Some of you fellow readers have seen me, when I was in the “zone”, writing code like there is no tomorrow. A few weeks ago, I had a particularly good opportunity to see what it means to be determined when writing code, something we should try to be every time. Actually, I believe it’s the main trait that separates people like Carmack from us mere mortals. They are able to focus on the work at hand, and overcome every problem with seemingly no effort, while we fail on the simplest hurdle.
A side note on this post, if you wonder about the subtitles, there’s a simple pattern, and if you get it, feel free to comment ;)
Let’s get back to chronological order, or rather, it’s a sunny, and I come to work. I’m doing research now, so progress is much more important than everything else, and I had to get some basic algorithm running so we could see whether we’re heading into the right direction. I know now that the direction was wrong, so all the stuff I describe here has been scrapped and replaced with a hopefully superior approach, but it’s still a very good example for what I want to explain in this post.
It’s only pain
The problem was an optimisation problem, for which we wanted to use an iterative solver. To make it simple, we would compute each step by brute-force (instead of some fancy gradient descent or some other optimisation scheme), so I started the day by prototyping the algorithm in C++. Per iteration step (I’ll abbreviate as IS for the remainder of this post), it would have to perform a smaller number of computation steps (CS) over a few million elements. The time for a CS clocked in at 3 seconds, which would allow 50 iterations per night – this should be enough to see whether the algorithm was optimising anything at all.
However, I had this feeling that I can do better:
- the CS was crying to be done with SSE, as it was already working on float arrays (with the length being a multiple of four, it won’t get more obvious than that)
- the IS could execute all CS in parallel – and my machine had four cores
So, my mental plan was to do the SSE optimisation first, and as soon as it gives reasonable results, I would jump into the making it multithreaded, and by lunch, I would be able to start it; so far for the plan.
Land of confusion
I gave the SSE optimisation a quick shot with intrisics, giving me 15 seconds per CS instead of the 3 seconds for the C++ version. You can imagine how puzzled I was, especially as all I did was to replace the single-float operations by proper vector operations. However, I was not ready to give up that quickly. Maybe I have read to much of Michael Abrash’s “Black Book”. I took a more in-depth look at the used intrisics, dug out a new SSE3 operation I never used, unrolled a loop, and came down to 6 seconds. Still much slower as the compiler, and if somebody would step in, I’d have a hard time to explain why exactly I was doing assembly for over an hour ;)
My next try was to investigate the generated code. A good one, as the the compiler generated a bit weird code for the intrisincs, and at that moment, I was convinced I could do it better on my own, by rewriting the routine in assembly. Bear in mind, right up to that moment, I had never used inline assembly in Visual C++, so I had no idea how to find out in which register the parameters are, and how to actually access memory operands. Let alone that my general assembly skills were not quite stellar.
Look no further
However, it was all too late, and I started reading the docs, browsing through a bit of example code, and hack. I think I was crazy enough to finish it in less than half an hour, including debugging, and bam, I was down to 1.3 seconds. Some random rescheduling of the instructions brought it down to 1.2, and that was it, immediate satisfaction, I had won another time. Over night, I would be able to run three times as many iterations as before.
The next hours were filled with evaluation, that is, whether the CS was working properly. There were some slight differences between the vector operation code and the C++ code, but it turned out to be floating point precision issues. Everything looked good so far, and one hour before I was planning to finish working, I thought it might be a good time to give the parallelisation a go. In the spirit of the day I decided to use the Intel Threading Building Blocks, as just as inline assembly, I have never used them before (this is not entirely correct: I had used the concurrent vector from the TBB in an OpenMP based project, but I had not used the TBB to execute code in parallel yet).
Makes me wanna pray
For the IS, I would have to run several CS concurrently, and gather the results, so I had to come up with a way to store the intermediate results without any shared state by the CS. This proved to be quite simple, and I wrote the IS with a parallel loop, ready to see it running in all it’s parallel glory.
What I saw instead was an error that the CRT DLL was not found, which could not be solved quickly by copying the CRT DLL into the folder. Rather, an “application initialisation error” would come up. The handy dependency walker was telling me that my application was doing something wrong with the CRT. It really looked like some manifest or versioning problem, or better: Like several hours of fun to figure out which libraries were compiled against which DLL version.
I did a quick hacking, and the application started indeed, just to crash at the first TBB call, so I suspected that it might be some dynamic reloading which causes the problem, plus I would probably have to rebuild all my dependencies.
When I’m dead and gone
At that time, my supervisor came in, and I explained the problems to him in the darkest colours. He tried to convince me to rather throw away the TBB stuff, and thread on my own with the bare WinAPI, and skip optimisations so early in the project. The reasoning is not wrong, after all, if I spend 7 hours to optimise something down from 8 to 1, I could have let it run for 8 as well – the net result would be the same.
That was the final nail in the coffin, I would let it go, and come in back after the weekend, try to get it running for no more than one hour, and evaluate the non-parallel IS.
I fired up my RSS reader to browse through the news, fed up with this issue. The next week, I came in to work again, and started to check the algorithm. Each IS would take four times longer than necessary, and this blog post would be a rather lame on at this point. If you know me, then you know that I’d rather spend the 7 hours to get it running 8 times faster, and then rewrite the whole stuff the next morning to be both fast and good to maintain. Probably, I would rewrite it once more afterwards, just to make sure it has been rewritten three times, because I tend to rewrite everything at least three times.
It’s hard to explain properly, but sometimes, you can literally feel this determination building up. I guess that our programming heroes are in this state all the time, which allows them to go where none of us can. But at this moment, I had this feeling that no matter how big the obstacle, I would be able to overcome it, even if this would mean that I would have to (re-)write the IS using something different. Two things were clear for me at that moment. Number one was I would rather die than to leave this single-threaded, and number two was if the TBB won’t run in 30 minutes, I’ll take yet another library – this time for thread pools – and learn how to use it. No chance that I would write something with the WinAPI, as this wouldn’t allow me to improve any skill. I had my share of WinAPI threading, and there is little to be learned by manually threading an application.
As I only write about days with happy ends, I eventually got the stuff working, and was processing one IS in 3 minutes. The next week, I could evaluate it with the confidence that I did as good as I can, and for this day, there was nothing left which I could improve.
Mercy on me
Looking back, I realize that I could have finished the stuff earlier, if I would have been determined to do it when the first errors with the TBB showed up. After all, it was a minor problem, and by the time my supervisor came in, it could have been running. I hope that the next time I come into a similar situation, I will remember this experience and try harder. This is one point, the other interesting question is: Would I have been able to do more that day, if I would have skipped the optimisation stuff altogether and started evaluating in the afternoon?
Surprisingly, I’m sure the answer is no. No, because as explained above, I would have gone mad waiting 12 minutes for one IS to finish, with the knowledge that this is only my fault. For me, this was becoming yet another fork on the way to become a good programmer, on the left side, the broad road going down the hill, and on the right, the rocky path going upwards. And this time, I have chosen the right one, and there is nothing to regret. After all, I was going to get results much faster, and get more interesting results.
The swing of things
There is one key point to take home from this post: If you come to a tough problem, try to become really deeply determined to solve it, at a level where you can be sure that you’re trying as hard as you can. Having had this recent experience, I’m pretty sure that this is one of the key points to get into the “zone” of high productivity, and eventually heroic programming. If you don’t manage it on your own, ask a co-worker to push you, or chat with a friend and explain the problem. You’ll be amazed how lame your excuses will sound.
It can be difficult if you can’t be confident in your skill set, and if you don’t have a long track record of solved problems, but remember: We’re all in the same situation. Few of us are programming heroes. The only thing we can do is to try harder every day, and I hope that this post gives you a little more confidence the next time you are about to do heroic programming, or just to get things done! For me it’s simple: I’ll think about all you fellow readers who are on the right road towards awesomeness, and I’ll try a bit harder to keep up with you!