Some notes on strings, data structures and data bases

June 18, 2009

approximately 2 minutes to read

A short war story on (not) using strings. In my application, I had objects which use a texture, and the texture was given as a string. I needed to index all objects by used texture, so I started by replacing the texture with an integer identifier in the objects, and provide a map from texture to id. This improves the performance, as grouping by texture is no longer a string compare, but just an integer compare (longer explanation: As I create indices in order, the identifier is actually an index directly into the texture array - you can’t get faster than that.)

The problem with this approach became obvious when I had to serialize the objects in a later step. I needed another map from integer id back to the string, so I could simply write the string out to disk. Which in turn required maintaining two mappings, and passing on one of those mappings down to the serialization function, which is worse than it sounds, because now the serialization is tightly coupled to the map.

Eventually, I dropped all this stuff, and stored the strings directly on both sides, and changed everything to use strings (all maps, all sets, every intermediate structure would use strings again.) Guess what? The performance difference was not measurable.

What can we learn from this experience? Strings are not inherently evil, and unless there is a clearly better solution, there is no reason to avoid them. Having strings has also the nice advantage that it is much easier to debug strings than looking at integers, especially if you dump a data structure to disk. If speed is really a concern for you, I’d suggest you to look at the excellent string-based containers in LLVM. They have string based sets and maps, which are likely to perform good enough for most use cases. If you use C++ TR1, you can also try the hash-based containers. Of course this is no excuse to use strings in tight inner loops where an integer identifier would work as well, but this seems an unlikely use-case anyway (plus your profiler will probably show this usage rather quickly).

Databases to the rescue?

In retrospective, it might be even worth to dump the indexed data structures in my application completely in favour of a memory-based embedded database, like sqlite. This solves the persistence problem nicely - every database comes with a way to store the data to disk. Plus, it allows higher-level manipulation of the data itself. I guess this is something I’ll have to try: Basically, instead of loading a data structure from disk and transform it into a runtime representation using maps, sets and vectors, I’ll make a copy from a disk-based database into a memory database, and work directly on this. That is, retrieval gives you a proxy structure which represents a row, setting parameters results in queries, and retrieving results is always cached. This is not so interesting for prototyping, due to the coding overhead, but could be a viable solution for the final version.

[Edit]: A short warning to not use strings in inner loops and other performance-critical cases.