Get SpatialOS

SpatialOS

Company

Careers

Sites

Automatic Race Condition Detector

One of our engineers tackles the hard problem of identifying and fixing race conditions in a massively distributed system that uses actor-based concurrency. After his attempts to invoke the Norse Gods fail, he turns to formal verification methods applied to protocol design.


Making distributed systems is hard. At Improbable, we like it when our system works, and it makes us sad when it breaks. We also like simple and elegant systems. In order to run simulated game worlds at the scale made possible by SpatialOS, we have to share the work dynamically across lots of cloud servers in a way that is invisible to the developers, which unfortunately is an inherently complex problem. And that’s when the pain starts.

Now, I like debugging as much as the next guy, but sometimes banging your head against the wall of a weird error that seems to only occur after running a simulation on hundreds of machines for 2 days gets old. Invariably, our hardest-to-find bugs are race conditions. A typical SpatialOS deployment might consist of several million actors running on a hundred or so machines, spanning code written in three languages. Sometimes it has taken us weeks of engineering time just to find and fix a single race condition, which is expensive and very, very painful.

After a long night of debugging and soul searching, we came up with a simple idea: “why don’t we just stop writing code that has race conditions?” We wondered why we hadn’t thought of that before, and so we set to work.

We started with the most obvious course of action: build a shrine to mighty Odin, and pray. This worked for about two months, but soon we were back in the mountains of madness.

When even the Elder Gods have failed you, you know it’s time to pull out the big guns.

Introducing Formal Methods

For us, the big guns came in the shape of a bounded model checker called CBMC. Essentially, a bounded model checker takes a program, unwinds the loops and then uses symbolic execution to turn the resulting loop-free program into a SAT formula, which it hands to a SAT solver. The idea is that the SAT formula is satisfiable if and only if there is some input to the program that will cause some assertion to fail. So if the SAT solver says that the formula is UNSAT (unsatisfiable), we know that there’s no way any assertion can fail. Conversely, if the SAT solver says the formula is SAT (satisfiable), it’ll give us back a program input and an error trace that shows us exactly how the assertion can fail.

The kernel of SpatialOS is written in Scala with actor-based concurrency as implemented by Akka, but CBMC analyses C/C++ and doesn’t handle actor-based concurrency. That meant that we had to design a system to describe actor based protocols that could back onto CBMC. It looks a bit like this:

We wrote a library for describing actor-based protocols in C. The user supplies a description of their protocol (also written in C). We pass the resulting program through CBMC, which in turn hands off to MiniSat and, if there is a race condition, produces an error trace. We automatically parse the error trace and pretty print it using LaTeX. If there is no possible race condition, we do a victory dance and go home.

To the best of our knowledge, expressing messaging protocols in C, using a bounded model checker to verify them, and automatically generating error traces hasn’t been done before.

An Aside: Actor-Based Concurrency

We use a style of programming called the actor model. Essentially, in this model the program is composed of a bunch of actors, which are self-contained pieces of logic that can contain some mutable state. Actors can send messages to each other, but they can’t directly mutate another actor’s state. When an actor receives a message, it can do some processing, mutate its own state and send messages to other actors.

The order in which messages are delivered can be quite surprising: the only guarantee is that if an actor A sends two messages to actor B, then the messages will arrive in the order in which they were sent.

For example, let’s say that we have actors A, B and C. A sends message 1 to C followed by message 2. At the same time, B sends message 3 to C. C can receive the messages in the order [1, 2, 3], [1, 3, 2] or [3, 1, 2], but cannot receive [2, 1, 3] (since A sent message 2 to C before sending message 1).

The events in the system (sending and receiving messages) have to respect the partial order given by the constraints above. Unfortunately, the number of partial orders grows quite fast with the number of events, so we can’t just enumerate all the possible orders of messages—it’d take too long! That’s why we use CBMC instead, which can use fancy symbolic methods and SAT solvers to make the analysis fast.

Finding Race Conditions with Verification Driven Development

Let’s look at a simple example. Suppose we have three actors: W1, W2 and Data. The Data actor contains an integer variable named x, initialised to 0, which W1 and W2 want to increment. Since these guys are all actors, they’ll try to do this through message passing. Here’s the first attempt at a protocol they might use to achieve this:

W1 W2 Data
receive() match {
   case Start =>
     Data ! Get
   case Got(x) =>
     Data ! Set(x + 1)
 }
receive() match {
   case Start =>
     Data ! Get
   case Got(x) =>
     Data ! Set(x + 1)
 }
receive() match {
   case Get =>
     sender ! Got(x)
   case Set(y) =>
    x = y
 }

We code this up in C (which takes up about 50 lines) and pass the resulting spec to our tool, checking the assertion that at the end of the protocol, Data.x == 2. The tool outputs the following diagram:

Oh no, a race condition! If Data receives both Gets before either Set, it’ll end up with x == 1. Disaster! Luckily, the diagram above is very helpful to us, so we can try to come up with a new protocol that doesn’t have the same problem.

This is like test-driven development, but turned up to 11: rather than trying to come up with enough test cases that exercise every behaviour of the system (which is at least as hard as building a correct system in the first place), we can just write one simple specification of what the system should do and automatically check every possible behaviour.

We call this Verification Driven Development (VDD) and it’s completely changed the way that we build reliable software.

Now let’s try to fix the protocol. We’ll try:

W1 W2 Data
receive() match {
  case Start =>
    Data ! Inc
}
receive() match {
  case Start =>
    Data ! Inc
}
receive() match {
  case Inc =>
    x += 1
}

We hand this protocol to our tool and it tells us that it’s race-free. Bingo! Let’s translate it to Scala and get it into production!

Of course, this is a fairly simple protocol; it’s possible for an engineer to reason about it and “see” the problem and its fix. However, race conditions even in slightly non-trivial protocols quickly become impossible for a human to understand intuitively. An automated tool is not a nice to have, but a necessity.

But even a brute-force approach would not work. For the first protocol, with only 12 events (sends and receives for 6 messages), there are 12! (half a billion) orderings to check, which would take less than a second at a billion orderings checked per second. But the number of possible orderings grows very quickly: just by increasing the number of writer actors from 2 to 4, we’d need about 74 days to check all the orderings!

On the other hand, using our tool can find the race condition in two seconds, or verify the correct protocol in three. That’s six million times faster than the brute-force approach. Verification for the win!

We’ve been using this tool for a few months now to design our new concurrency protocols, and to analyse existing parts of our system. It feels good to know that for every new protocol we build, we’ll never have to lose a week of sleep debugging race conditions.

What We Learned

Our engineers can quickly build algorithms that just don’t have errors.

When you’re building massively complex systems with a small team, you need strategies to manage the complexity. Systems don’t come much more complex than SpatialOS, and so we’ve found that the only way to move forward as quickly as we do is to build solid foundations. Now that we’ve built tools that make verification-driven development possible, our engineers can quickly build algorithms that just don’t have errors. This has increased our productivity massively (designing new protocols is way faster, and we lose far less time to debugging tricky race conditions) and means that we can be sure that there’s something solid under our feet.

This kind of engineering is what makes SpatialOS so powerful, enabling completely new kinds of games and simulations. We’ve built a long lever and a place to stand, now you can move the world.