Peter Mounce is a build and release engineer in the engineering velocity team at Improbable.
In this post we’ll be talking about a team’s journey to effect a company-wide set of changes to make continuous integration (CI) not just better but great. It captures our team’s effort over 18 months, addressing years’ worth of accumulated problems, and ending up with an internal product that we are thrilled with.
At Improbable, the engineering velocity team are a group of people devoted to making the practice of delighting users with our software less viscous. Since May 2018, this has involved a bunch of effort to improve how we use CI. What is CI? It’s what keeps people honest – it watches source control for changes, then builds/tests/packages/releases them. Why is CI needed? People forget to do these things before committing or merging code. You begin to benefit from CI when you have more than about three people working on software together.
What took 18 months? Well, as a fast-moving technology startup, we were faced with a number of problems and expectations:
We used two products – we won’t say which ones – and both had deficiencies that meant we didn’t want to continue using them.
Build agents were shared between teams building software, but managed by a team that wasn’t building software. This meant that changing a dependency usually caused an outage for one team or another, because the managing team wasn’t aware of which dependencies at which versions customer teams had, and nor was any single customer team.
Build agents were enlisted into config-management, but that would run at any point, meaning that a build might be interrupted by a config-management change.
Build agents were managed as pets, not cattle. This meant lots of manual fixing of build environments, and inevitable configuration drift, even with config-management (because the people debugging often didn’t know how to reflect changes back to config-management). This also led to ‘Ye Olde Reliable’ syndrome, where particular builds would only have a hope of passing on particular agents but no one knew why.
We didn’t have enough build agents to meet our demand as we grew and we didn’t want to buy more licences for products that weren’t working well for us. We also didn’t want to have a finance conversation each time we wanted to add more capacity.
We overspent on build agents because we ran them 24/7, regardless of whether they were busy.
We didn’t source-control our CI configuration well.
We needed to support around 12 teams with varying requirements.
We wanted to open-source parts of our codebase (check out our GDK for Unreal and GDK for Unity), which meant we wanted to enable a reasonable contributor workflow – that involves being able to read a build log to see why CI failed.
Debugging broken builds was quite inefficient. If you found an error that your code hadn’t introduced, trying to find out when it was introduced involved looking in build logs one by one. Consequently, this didn’t happen much.
Optimising builds was inefficient. Answering the question “Which parts of my build are slow?” was harder than it needed to be. You’d have to run a build and then do timestamp arithmetic, build by build. Consequently, this didn’t happen much either, so builds got slower.
Our CI masters accreted loads of dangerous and unowned secrets to all sorts of interesting and powerful things. No one knew which secret was used by which build; no one knew whether it was safe to rotate or delete a secret; and no one knew whether secret A was actually a duplicate of secret B.
Turns out we had a lot of technical and organisational debt and risk in how we did CI. This wasn’t surprising – it had grown organically over the lifetime of the company, only had dilute ownership at best, and there was no concerted investment to improve it. We had a lot to fix.
We started by listing the problems, turning them into MoSCoW-prioritised requirements, adding use-case or problem statements to each requirement (to short-circuit the “I don’t think this is a must” discussions), then doing some market research.
We decided on some tenets for what we wanted to offer our engineering population:
We will prioritise engineering time over compute spend, but we won’t waste compute spend if we can avoid it. For example:
We don’t want an engineer to even open CI unless CI breaks because of an error in the engineer’s changes. When that happens, we want it to be as obvious as possible what to do next.
We don’t want an engineer to spend time waiting because their build is in a queue.
We never want an engineer to spend time debugging a build environment and the result to be “It’s different from expected.”
We don’t want an engineer to do tedious toil to find where a problem was introduced.
We don’t want build-automators to do tedious toil finding out what to make faster first. Build-automators are scare specialists and we want to empower them and make it easy to figure out where to start.
Teams own their own build environments. For example:
You want to ship some software? It’s your product, so we’ll train you how to set up build machines with sane repeatable source-controlled methods.
Your build agents is misbehaving? Shut it off – another will be along shortly.
Your build agents keep misbehaving? We’ll help you look and solve your problem, or you can keep shutting them off – your choice.
Your build agents are too expensive? Let’s look at monitoring and logs together to see if there’s anything obvious that you can optimise.
Team A can’t break team B by surprise.
We don’t want an engineer to turn off useful test coverage “because it costs too much cloud spend”. (This narrowed the field of possible products the most, because only a couple of CI products are licensed per user).
It will always be possible to make a historical build, using CI configuration and build environment as at the time of the original point in history.
When teams migrate workloads from our existing CI, we’ll migrate, then improve.
If a person asks for help and we can’t answer with a link to user-facing documentation that we’ve written (or the product’s own documentation), we have two things to fix: the person’s problem and our documentation.
We want it to be not just possible but easy to CI games.
We don’t want an engineer to need to use CI for any reason other than “wrote some code that was wrong for some reason”.
We chose Buildkite because it met all our requirements. It licenses per user and lets us run as many build agents as we like, on any platform we want, on-premise or on someone else’s computers in clouds. We have nothing but good things to say about Buildkite after 18 months of advanced usage.
To support historical builds and build-environment-drift reduction: a Packer+Ansible baking process for making golden images, so we can have Google Cloud spin up instances from a known good, versioned, starting state; plus a way of source-controlling CI configuration inside the repository containing the codebase; and a way of pushing that config via command line to Buildkite to keep it in sync.
To address capacity issues: an autoscaler that uses Buildkite’s Jobs API to understand when there are jobs waiting, then scale up instances to process those. We don’t want engineers to wait for queued jobs to start CI.
To address cloud-spend issues: the autoscaler scales down instances that are idle, and allows teams to self-service, adjusting most run-time properties themselves, via their source-controlled CI configuration. We also exported all node metrics to our Prometheus monitoring, so people can watch what their jobs cause nodes to spend (CPU/memory/network/storage etc) during activity, and we tagged all cloud resources, so we could attribute spend into different teams. Finally, we offered the cost dashboard next to the utilisation dashboard, so teams could optimise their cloud spend directly (caveat; planned work, not yet completed).
To address build-environment ownership issues: documentation and training materials to support customer teams using the Packer+Ansible workflow for creating their build-agent images, using a baking pipeline in CI.
To address macOS support: we’re using Anka Flow and on-premise macOS Minis (still baked using Packer+Ansible).
To address flake monitoring, de-flaking and other such investigation/debugging: we ship all our build-log lines into Elasticsearch/Logstash/Kibana, enriched with all the build-context information, so we can query effectively.
To avoid cargo-culting CI and knowledge rot: we wrote a lot of user-facing documentation as we went, making each page a guide for how to complete a particular task. This documentation style has proved very effective and has been adopted by other teams.
To eliminate cold-start delays from freshly scaled-up agents: we’re trying an approach that runs representative builds during baking in a controlled way, so that checkouts are present and caches are warmed up. This saves the time and network costs required to sync 200GB from a perforce master to a fresh agent. This is opt-in, because it’s at odds with our hermetic environments guarantee. We have ideas to make it more robust.
Tools that dramatically speed up the code and content builds of Unreal Engine 4 and Unity games: this is to the point where it’s reasonable to wait for a build to run before submitting a change. This was game-changing for our internal studios’ velocities. (We use FastBuild’s cache and distributed compilation combined with Unreal Engine 4’s Derived Data Cache for asset cooking; builds that used to take 4 hours now routinely take under 10 minutes).
The response to our CI solution was overwhelmingly positive:
“I like it – I barely have to look at it, but when I do, it’s intuitive.”
“It’s so great not having to click the ‘Run’ button when I push changes.”
“Our pipeline was 90 minutes end to end, now it’s 15 minutes, because we can parallelise.”
“Having source-controlled agent environments is a game-changer. No more being broken by other teams’ dependencies. All changes are in source control and reviewed, which also means it’s easy to roll back.”
“My experience with Buildkite is that it’s much easier to transfer an existing working CI setup with good best practices to a new project or team, and scaling agent capacity has been much less of a pain.”
“The killer feature for me in Buildkite is the ability to version the CI configuration and its steps together with the code in our Git repository.”
“A locked-down, source-controlled and versioned agent environment has given us much more confidence about the state of our agents and builds. This has allowed us to enable a distributed build cache, sharded by agent version, and has drastically sped up our CI.”
We’re not done yet, but now we can quite confidently say, “We’ve got great CI.” And we have plans to extend it and package it as part of the SpatialOS product offering. In our experience, some games studios would benefit from a turn-key solution for nightly (or per-submit, or pre-submit) builds, and as a company that wants to reduce the risk of shipping a game, we’re keen to help.
Does this interest you? Engineering Velocity at Improbable would love to hear from you – we’re recruiting.
Does engineering for a company that values human time over computer time interest you? We’d love to hear from you, too!