Archive

Archive for June, 2009

Short status update

June 29th, 2009

Since a few months, I’m an intern at INRIA – this is a French research institute. I’m in the REVES group, which is located in Sophia-Antipolis, next to the famous city of Antibes. In case this does not ring a bell, the next city to Antibes is Cannes to the south, and Nice to the north.

I’ll write a bit more about the location, especially Antibes, as I personally never heard about this town before.

Another short announcement: I’m going to SIGGRAPH this year as well, so I might be difficult to reach from 3-8 August.

Useful HTML reports

June 25th, 2009

While doing research, you sometimes run into the situation where you are getting loads and loads of results, and you need to create nice reports to simplify the analysis. Often this is tabular data, sometimes you might be able to create some images, but the key point is you need a simple way to bundle this all together. As the topic title suggest, HTML fits the bill very nicely. It’s reasonably easy to generate (after all, it’s text only), you can embed additional data (images), and you can also use JavaScript to improve the usability of the report.

For HTML generation, I started to use basic C++ iostreams, but this quickly becomes unmaintainable, and you will run into debug problems. If your markup is wrong, it’s very complicated to see in the C++ code where the error source is. A much nicer solution is Google ctemplate, a C++ HTML template engine in the spirit of Smarty. Basically, you can author the template using any HTML editor, add some markup for repeating sections, then you fill out some values on the C++ side, and your done. For debugging, you simply strip the template tags, and validate the code with any HTML validator. Two things are especially nice about ctemplate. First of all, repeating sections is trivial. If you add a dictionary for a section twice, it will be repeated twice in the output, no need to set up loop constructs on your own. Seconds, it’s fast — generating a 50 MiB HTML file takes ~ 2 seconds with it. For viewing such large files, I can recommend to use a WebKit based browsers, as at least Firefox 3.0 has some problems when closing such a file.

To improve the report usability, you can use make them interactive using a JavaScript. I can recommend the very nice JQuery framework — in fact, it worked so fine, that I didn’t bother to try another one ;) . JQuery allows to add hide/add hyperlinks very easily, which is useful if your report contains detail data that you don’t want to display all the time. All you need is to give your items an id, and then toggle the visibility of the item using $("#yourId").toggle(). If you have tabular data, JQuery has a real killer plugin: TableSorter. This allows you to sort tables interactively, without any additional setup code. All you have to do is to use proper headers for your table, and you’re ready to sort.

I have to admit that I still use plain-text only reporting quite a bit, but it’s usually much faster for analysis to have a nice report with some comfort (hyperlinks to additional data, embedded images). With the ctemplate stuff, the cost of adding HTML output over plain-text is rather minimal, so the next time you run into a situation where you’ll need some output, you should really consider to give HTML a try.

Google diff, match and patch ported to C#

June 22nd, 2009

The title says it all, I’ve ported the Google diff-match-patch library to C#. It’s a nice diff library, and supports Java, JavaScript, Python, C++ (with Qt) and C# with exactly the same API.

Porting wise, everything went fine due to the high test coverage (IIRC, something around 98% of the code is covered by unit tests), so I could quickly check if everything works as expected. For unit testing on C#, I initially used MSTest, but switched to NUnit for portability. The code works fine with Mono 2.4 and .NET 3.5, possibly also with older version. If you change the LINQ extension calls like First () to indexed based access, you’ll probably get it working on .NET 2.0 as well. So, have fun with this nice library!

Thanks to Neil Frasier for adding this into the project codebase and for further maintenance.

Some notes on strings, data structures and data bases

June 18th, 2009

A short war story on (not) using strings. In my application, I had objects which use a texture, and the texture was given as a string. I needed to index all objects by used texture, so I started by replacing the texture with an integer identifier in the objects, and provide a map from texture to id. This improves the performance, as grouping by texture is no longer a string compare, but just an integer compare (longer explanation: As I create indices in order, the identifier is actually an index directly into the texture array – you can’t get faster than that.)

The problem with this approach became obvious when I had to serialize the objects in a later step. I needed another map from integer id back to the string, so I could simply write the string out to disk. Which in turn required maintaining two mappings, and passing on one of those mappings down to the serialization function, which is worse than it sounds, because now the serialization is tightly coupled to the map.

Eventually, I dropped all this stuff, and stored the strings directly on both sides, and changes everything to use strings (all maps, all sets, every intermediate structure would use strings again.) Guess what? The performance difference was not measurable.

What can we learn from this experience? Strings are not inherently evil, and unless there is a clearly better solution, there is no reason to avoid them. Having strings has also the nice advantage that it is much easier to debug strings than looking at integers, especially if you dump a data structure to disk. If speed is really a concern for you, I’d suggest you to look at the excellent string-based containers in LLVM. They have string based sets and maps, which are likely to perform good enough for most use cases. Of course, if you use C++ TR1, you can also try the hash-based containers. Of course this is no excuse to use strings in tight inner loops where an integer identifier would work as well, but this seems an unlikely use-case anyway (plus your profiler will probably show this usage rather quickly).

Databases to the rescue?

In retrospective, it might be even worth to dump the indexed data structures in my application completely in favour of a memory-based embedded database, like sqlite. This solves the persistence problem nicely – every database comes with a way to store the data to disk. Plus, it allows higher-level manipulation of the data itself. I guess this is something I’ll have to try: Basically, instead of loading a data structure from disk and transform it into a runtime representation using maps, sets and vectors, I’ll make a copy from a disk-based database into a memory database, and work directly on this. That is, retrieval gives you a proxy structure which represents a row, setting parameters results in queries, and retrieving results is always cached. This is not so interesting for prototyping, due to the coding overhead, but could be a viable solution for the final version.

[Edit]: A short warning to not use strings in inner loops and other performance-critical cases.