Skip to main content

GraphQL in the GPU database

One thing I've been asked about is providing some kind of API access to the GPU database I'm running. I've been putting this off for most of the year, but over the last couple of days, I gave it yet another try. Previously, my goal was to provide a "classic" REST API, which would provide various endpoints like /card, /asic etc. where you could query a single object and get back some JSON describing it.

This is certainly no monumental task, but it never felt like the right thing to do. Mostly because I don't really know what people actually want to query, but also because it means I need to somehow version the API, provide tons of new routes, and then translate rather complex objects into JSON. Surely there must be some better way in 2017 to query structured data, no?

GraphQL

Turns out, there is, and it's called GraphQL. GraphQL is a query language where the user specifies the shape of the data needed, and the system then builds up tailor-made JSON. On top of that, introspection is also well defined so you can discover what fields are exposed by the endpoint. Finally, it provides a single end-point for everything, making it really easy to extend.

I've implemented a basic GraphQL endpoint which you can use to query the database. It does not expose all information, but provides access to hopefully the most frequently used data. I'm not exposing everything mostly due to the lack of pagination. If you use the allCards query, you can practically join large parts of the database together, and I don't want enable undue load on the server. As a small appetizer, here's a sample query executed locally through GraphiQL.

/images/2017/gpudb-graphql.png

Using GraphiQL to query the GPU database programmatically.

If you want to see more data exported, please drop me a line, either by sending me an email or by getting in tough through Twitter.

Background

What did I have to implement? Not that much, but at the same time, more than expected. The GPU database is built using Django, and fortunately there's a plugin for Django to expose GraphQL called graphene-django which in turn uses Graphene as the actual backend.

Unfortunately, Graphene and in particular, Graphene-Django is not as well documented as I was hoping for. There's quite a bit of magic happening where you just specify a model and it tries to map all fields, but those won't be documented. I ended up exposing things manually by restricting the fields I want using only_fields, and then writing at least a definition for each field, occasionally with a custom resolve function. For instance, here's a small excerpt from the Card class:

class CardType (DjangoObjectType):
    class Meta:
        model = Card
        name = "Card"
        description = 'A single card'
        interfaces = (graphene.Node, )
        only_fields = ['name', 'releaseDate'] # More fields omitted

    aluCount = graphene.Int (description = "Number of active ALU on this card.")
    computeUnitCount = graphene.Int (description = "Number of active compute units on this card.")

    powerConnectors = graphene.List (PowerConnectorType,
        description = "Power connectors")

    def resolve_powerConnectors(self, info, **kwargs):
        return [PowerConnectorType (c.count, c.connector.name, c.connector.power) for c in self.cardpowerconnector_set.all()]

    # more fields and accessors omitted

Here's also an interesting bit. The connection between a card and its power or display connector is a ManyToManyField, complete with custom data on it. Here's the underlying code for the link:

class CardDisplayConnector(models.Model):
    """Map from card to display connector.
    """
    connector = models.ForeignKey(DisplayConnector, on_delete=models.CASCADE)
    card = models.ForeignKey(Card, on_delete=models.CASCADE, db_index=True)
    count = models.IntegerField (default=1)

    def __str__(self):
        return '{}x {}'.format (self.count, self.connector)

In the card class, there's a field like this:

displayConnectors = models.ManyToManyField(DisplayConnector, through='CardDisplayConnector')

Now the problem is how to pre-fetch the whole thing, as otherwise iterating through the cards will issue a query to fetch the display connectors, and then one more query per connector to get the data related to this connector… which led to a rather lengthy quest to figure out how to optimize this.

Optimizing many-to-many prefetching with Django

The end goal we want is that we perform a single Card.objects.all() query which somehow pre-fetches the display connectors (equivalently, the power connectors, but I'll keep using the display connectors for the explanation.) We can't use select_related though as this is only designed for foreign keys. The documentation hints at prefetch_related but it's trickier than it seems. If we just use prefetch_related ('displayConnectors'), this will not prefetch what we want. What we want to prefetch is the actual relationship, and from there on select_related the connector. Turns out, we can use the Prefetch to achieve this. What we're going to do is to prefetch the set storing the relationship (which is called carddisplayconnector_set), and provide the explicit query set to use which can then specify the select_related data. Sounds complicated? Here's the actual query:

return Card.objects.select_related ().prefetch_related(
    Prefetch ('carddisplayconnector_set',
        queryset=CardDisplayConnector.objects.select_related ('connector__revision'))).all ()

What this does is to force an extra query on the display connector table (with joins, as we asked for a foreign key relation there), and then caches that data in Python. Now if we ask for display connector, we can look it up directly without an extra roundtrip to the database. How much does this help? It reduces the allCard query time from anywhere between 4-6 seconds, with 500-800 queries, down to 100 ms and 3 queries!

Wrapping it up

With GraphQL in place, and some Django query optimizations, I think I can tick off the "programmatic access to the GPU database" item from my todo list. Just in time for 2017 :) Thanks for reading and don't hesitate to get in touch with me if you got any questions.

Version numbers

Version numbers are the unsung heroes of software development, and it still baffles me how often they get ignored, neglected or not implemented properly. Selfish as I am, I'd wish everyone would get it right, and today I'm going to try to convince you why they are really important!

The smell of a release process

Having version numbers indicates some kind of release process, assuming you don't add a version to every single commit you do in your repository. This means you've reached a point where you think it's useful for your clients to update, otherwise there's no need yet to assign a new number. That's reason number one to have it -- communication with your downstream clients. It might sound stupid, but just by assigning version numbers to your commits, a client can learn a lot about your project:

  • Size of each release -- seeing how many commits go into every single version gives an idea of how much churn there is.
  • Release frequency -- do you assign a new number once a week? Once a month? This gives a good idea on how quickly you're going to react to pull requests, issues, and more. This is also critical information for any system level application as an administrator may have to install the update. Knowing the frequency and size of every release is critical to allocate the correct resources.
  • Bug fix check -- you fixed a bug, how does the client know it got fixed? Obviously, by presenting a version number to the user which can be queried.
  • Change logs -- assigning a version number is a good moment to sit back and think about what was added, writing up some documentation along the way.

You can encode even more information if you use semantic versioning, which in theory provides guarantees to clients when it's safe to update and more. While I like it in theory, I think that semantic versioning is mostly useful for libraries, less for large applications and frameworks as you'll typically end up incrementing the major version a lot. The only really large project I'm aware of that follows semantic versioning is Qt -- and they do a quite impressive job in regards to API and ABI compatibility. I think it's nice to have if you can enforce it, and I think it's worthing striving towards, but it's not the main value add.

But it's ... complicated!

I assume that most developers not using version numbers are aware of the reasons above, and didn't just "forget" them, but have a hard time versioning due to various reasons. Typically, there are two categories:

  • Continuous integration -- rapid releases, no formal release process.
  • Very branchy development process -- versions are branch-specific.

To point one, the continuous integration: No matter how you write software, your releases happen over time. You typically don't expect your clients to update to every single release you're doing, so how about using the ISO date (year-month-day.release) as your version number? Turns out, that will usually work just fine, and it still allows people to refer to things with a common naming system instead of referencing your code drops with a hash or some continuous integration commit. In fact, I'd argue you're set up for success already because the very same system you use for continuous integration can also assign version numbers.

The other problem is super branchy development, where you have multiple lines of code in development concurrently. Let's say you have one branch for stable releases, one branch for future releases, and one maintenance branch, and there's no good correlation. The trick here is to look at the problem from the client end -- for the client, there's only a single branch they see. It's your duty to fix it such that the useful properties outlined above are present for all your clients, which may mean that every branch gets versioned separately for instance, or that you treat your branches as separate products. This is something I noticed many people forget in software development -- we're not writing code for us, we're writing code for our users, and if our process makes their life harder, we've failed, because (at least, that's the theory) there will be many more users than us, so their time is more precious.

Version all the things

I hope I could shed some light on the value of version numbers and make you hesitate next time you're about to send an email which says "everyone should use commit #4237ac9b0f" or later :) Do yourself a favor, use that tag button in your revision control system, and make everyone's life simpler. Thanks!