Why did we rebuild the SpatialOS Runtime?

16 July 2019

The SpatialOS Runtime mediates an online game’s various servers, clients and engine instances. Earlier this year, we updated it substantially.

Back in January, we released ‘Runtime v2’. This was a complete re-architecting of the core SpatialOS Runtime components. We addressed a lot of our concerns and uncertainties about the performance, stability, and linear scaling of the first design.

Runtime v2 has been a fantastic success for us as a team. Now that it has been ‘out in the wild’ for a few months, we’re going to reflect over the last couple of years and take an honest, critical look at how both our technology and team worked, how it has improved, and how we can continue to make them better.

We’re going to reflect on three main areas in this retrospective:

  1. The decision to rebuild the Runtime.
  2. The predictability of our technology.
  3. Our maturing as a platform.

But first, before we dive too deeply into our decisions, we should briefly go over what the Runtime is supposed to do.

What is the Runtime?

The Runtime manages the canonical state of a large multiplayer game world.

As a piece of technology, we describe it as a system for “scalable view synchronisation”. What this means in the context of games is best explained by an example:

A player’s game client needs to know about some of the world’s state around it, like where the nearby NPCs are and what they’re doing. This is the player client’s view of the world, and as the player moves around the world, the data they want about the world changes.

The Runtime is responsible for making sure each client’s view is synchronised with the Runtime’s canonical state of the world. Because our primary application is for multiplayer games, we need to make sure that this synchronisation happens with minimal latency.

The main design element that makes SpatialOS unique compared to other pieces of technology for game networking is the ability to scale a single game world. Putting more computational power behind a Runtime instance (by provisioning more machines to run it on) allows it to provide linearly more players, a larger world size, and higher throughput of changes to the world.

The decision to re-architect everything

Back in 2017, our core technology was in a good state to demonstrate that previously impossible games were achievable and that there was market interest in such technology being widely available. 

However, as a team, we had increasing concerns about how we were going to keep improving the technology. New features were getting harder to build and much slower to deliver, performance optimisations felt increasingly like convenient hacks, and debugging the system when things went wrong was a nightmare to support.

It was not an easy decision to spend the next year and a half rebuilding Improbable’s core technology. That’s an enormous amount of engineering time. Choosing to build something new rather than to incrementally improve what we had was a risky prospect.

The most important factor in favour of a complete overhaul was that we had used the original Runtime to get a much clearer picture of what we needed to build as a product. We used it to experiment with a lot of ideas  - for example, the origins of the first Runtime architecture was an actor-based message routing system, rather than a framework that dealt with views of data! We made assumptions about which features would be important to support. Some of these experiments had technical debt consequences, and some of our assumptions were inevitably wrong.

The thought was that, if we could start somewhat afresh, we would be able to build a system better-suited to our more firmly established product vision. There were also features, like Query Based Interest, that we knew would be valuable to users, but we would fundamentally not be able to support at scale on the old architecture.

The rebuilding plan

We broke rebuilding the Runtime into three separate standalone blocks: each of them reworked a core component of the system to be more aligned to what SpatialOS should provide to game developers. Together, the blocks would give us the system we hoped for, but each was valuable in its own right. This reduced the riskiness of the project to a level that we were much more comfortable as a business: even if one of the component reworks went wrong, we would still have improved our product.

These three blocks make up the main areas of our system today. The entity database provides a canonical representation of data that can be split out across several shards and is designed specifically for the concept of clients that subscribe to views of a world. The load balancing system provides a way of managing write access to data from our users’ game code. And, finally, the bridge system provides a unified interface to player clients and simulating servers to interact with game deployments.

Looking back 

In retrospect, we can see that it took us a long time to decide on the correct course of action. We had already spent large amounts of engineering time on cleaning up and incrementally growing the technology before the decision to rebuild. 

We don’t see this as a bad thing by any means: as engineers, it’s far too easy to convince ourselves that the right solution is to throw away what you have and build something new. It’s also important to point out to ourselves that just because the project ended up being a success does not guarantee that we made the right decision. Instead, we should be reassured by having done very intensive due-diligence on the risks of this project and having plans in place for the event where things went wrong.

As the Runtime is no longer as monolithic or interdependent, our hope is that we do not need to make a decision like this in the future. 

We’re already starting to see some evidence that this is the case. Since the release of Runtime v2, we have actually substantially changed the bridge component to improve its throughput and reduce its resource consumption. We were able to do so in parallel from the rest of development, and swap in the changes without any changes in semantics or need for SDK upgrades for our users.

Reaching for maturity

One feature we advertised extensively in the early days of SpatialOS was “automatic, dynamic worker load balancing”. It’s a great example of a much broader theme of one of the past design shortcomings of the Runtime.

The idea was that the Runtime would inspect how much load was on each simulating worker in the game world, and dynamically redistribute entities to even out the simulation load. It could determine how many workers a game needed to run then them - either by shutting them down if there was an excess, to save on cost, or by starting up more to handle any spikes in load.

The feature sounds great on paper and made for some really cool demos. The algorithm for redistributing load was sound and the mechanism we’d built for it worked. However, in practice, we saw many users abandon it not long after they started development. 

They abandoned it because the system was totally unpredictable from their perspective and hence made the design of their game systems much more complex. They preferred to know which workers would simulate which entities over any cost savings they could get from the dynamic load balancing.

What we learned

This load balancing experience was instructive for our future plans. We now try to think about designing systems that are intuitive and more firmly in the control of the user. Instead of providing magical black boxes, we want to provide sharp tools.

This has become easier for us as a team as other parts of the SpatialOS offering (notably the GDKs which nicely wrap Unreal and Unity) become more mature. They’re able to provide systems that work out of the box immediately, while we provide the lower-level abstractions for power-users.

In the spirit of a critical retrospective: we’ve come a long way, but we haven’t gone far enough yet. Our system is far more predictable and understandable than it once was to us as Improbable engineers. However, we are still a long way from making the behaviour of the Runtime, especially how it responds to load and how to scale to bigger games, more understandable to non-specialist developers using SpatialOS. We have more solid foundations to work with now, but we still need to put in the effort to make this a part of the product that shines rather than falls short.

Ensuring smooth migrations

When we started planning Runtime v2, we had a few studios working with SpatialOS and had already made the platform available publicly to anyone. Over the course of the next eighteen months, we were also actively attracting new users.

Introducing big breaking changes or version incompatibilities would harm our relationship with our users and weaken their trust in the maturity of our platform. Getting the migration from Runtime v1 to Runtime v2 right was going to be an important milestone in demonstrating that we would be a stable development environment to adopt.

It’s fair to say that about half of the work of Runtime v2 wasn’t building the new system, but planning a migration path so no users would be adversely impacted. We needed to verify and firm up the more subtle semantics and timing interactions of our APIs, provide configurations that gave interoperability with older entity visibility systems, and validate our developers’ end-to-end usage of the product.

We released Runtime v2 in a way that allowed users to incrementally enable the new core components. Each of these incremental stages was provided as an opt-in for any users that had concerns about a specific component affecting their game. After the last stage had been around for about a month, we made Runtime v2 the default and considered it ‘launched’, with the old version available for users running into issues with the upgrade.

Two of our users did temporarily “opt-out” of the new Runtime for a couple of weeks. Both studios had large internal milestones to meet with technical demos and did not want to introduce additional noise during a high-risk period. They were both able to easily migrate off the old Runtime after their milestones, after which we promptly removed the old Runtime from existence.  

In many ways, the launch of Runtime v2 was pretty quiet from the perspective of the outside world. We’d like to think that this indicates we did a good job at proving we can develop our platform in a way that is stable and sustainable. 

Where is the team going now?

Looking forward, and despite the success of this release, we’re actually looking to provide even more tools for our users to have confidence in the performance and stability of the Runtime. Up until now, we’ve operated on an evergreen release model for the Runtime, meaning that it is rolled out as part of our platform’s infrastructure on a regular basis. 

However, a couple of recent occurrences, such as the need for a platform-wide freeze on deploying to production during events like GDC or game launches, have highlighted that our process for releasing the Runtime needs to improve. If a user had faced an issue in production during these periods that required a Runtime change to mitigate, we would have been unable to service their request. We can’t rely on ‘getting lucky’ and should preempt issues of this nature.

In the near term (the next six months or so), there are several projects that we’re working on that leverage the new Runtime architecture. We’re looking much more deeply into performance, specifically in terms of cost efficiency, of the Runtime. Several SpatialOS games are heading towards launch and we want to reduce the resource usage of the Runtime to reduce operating costs and increase the number of concurrent players and the data throughput for fixed amounts of hardware. 

We’re also looking into improving the observability and predictability of the Runtime to users, as we called out earlier. There are a few outstanding features, specifically around providing more ‘sharp’ load balancing tools to users, that we’re also looking at completing. 

Finally, the less monolithic nature of the Runtime codebase also means that we’re looking to grow our team’s headcount, as we’re in a much better position for parallelising work between more engineers.

The far-beyond

For predictions beyond that: there are external factors that could sway our priorities. The needs of game developers actually building games on SpatialOS are the most important signals for what we should be working on, and we acknowledge that their needs change and evolve. 

One of the directions that we’re particularly passionate about as a team is supporting truly enormous game worlds. Games currently being built on SpatialOS are merely “testing the waters” of the achievable scale. We’re both hoping and expecting that, sometime soon, a game design will come along that pushes the Runtime beyond the scale it is currently capable of. We’ll need to be ready to take on the ambition  of such a game at the same degree of robustness, predictability and usefulness that we are striving for today.

There’s a whole swathe of things to design and build, like faster migration protocols between machines or better dynamic scaling of the Runtime’s computational resources, that we know we’d eventually need to support that kind of game. It’s our cherished hope that we will get a reason to implement them soon.