This January Telstra, Australia’s largest telecoms provider, suffered a country wide outage. The outage affected 16.7 million subscribers on its 3G and 4G networks; customers lost the ability to make phone calls, while other users reported a complete loss of data services for over three hours. The outage not only cost the company millions in compensation (just last week, the CEO announced a day of free data) it also placed businesses who rely on mobile phone communications at risk of losing hours of trade.
These kinds of incidents aren’t rare; errors happen often, frequently at scale.
These kinds of incidents aren’t rare; errors happen often, frequently at scale. Sometimes they happen by mistake. In 2008, for example, an ISP (Internet Service Provider) in Pakistan famously cut off access to YouTube for a large portion of the world by accidental misconfiguration of a central server.
Accidents happen, but of more concern are deliberate attempts to hijack traffic. At the height of the Arab Spring, the Egyptian government closed 88% of the Egyptian internet, blocking more than 3,500 routes which ensured that no website could be accessed and that its citizens could not communicate online. Meanwhile, the largest Distributed Denial of Service attack (DDoS) occurred against a non-profit anti-spam organisation Spamhaus, affecting millions of ordinary Internet users.
What does this tell us? That these kinds of events are happening more often, and getting more severe. Businesses, governments, and ordinary people are increasingly reliant on services they get over the Internet. Imagine the implications of these kinds of errors and attacks on the Internet’s infrastructure when our cities, cars and home are also connected. Truly understanding what the World Economic Forum once termed “the dark side of interconnectivity” is critical; but this is well beyond the reach of current analysis tools.
Simulating the Internet
A few weeks ago, a team of two came in from the British government to explore our technology. Their goal was to build a realistic simulation of the internet so that they could take a look at its “structure”, or in other words, the vast number of connections between computers and networks that make up the World Wide Web. With the internet under attack from a variety of sources, it’s critical they can see its weak spots, to figure out how to protect it.
We believe it is the largest simulation of internet infrastructure ever to have been created.
Quite an ambitious project, given they weren’t familiar with SpatialOS, and the huge scale and complexity of the system they wanted to simulate.
In their own words, “Having never developed an application on SpatialOS before, this was a tall order for a 3 day sprint. However, combining the flexibility of the platform with the experience and enthusiasm of Improbable’s engineers, resulted in a simulation that surpassed all initial expectations.”
The result was a 1:1 scale simulation of the backbone of the internet. We believe it is the largest simulation of its kind ever to have been created.
What does this simulation show?
“Not only did we demonstrate a dynamic model of BGP routing at scale, we also produced an interactive visualisation where both AS’s and the connections between them can be created or destroyed, observing dynamic routing, cascade failures and new route propagation across the network.”
In precise terms, this is a fully dynamic and interactive SpatialOS simulation of all the Autonomous Systems (AS) on the internet, using a communication system closely modelled on the Border Gateway Protocol (BGP). An AS usually belongs to an ISP such as British Telecom or Comcast, but may also be owned by sites like Netflix or Youtube.
BGP is a routing protocol that exchanges reachability information between networks. In other words, it’s one of the core protocols that makes the internet work. At its core, BGP is driven by routing tables. A routing table is basically a big list of all the routes for each destination from a given starting point. When a new connection is introduced between two autonomous systems, they will exchange their routing tables and update their own copy to include the new connection information.
For example, when you try to load youtube.com, a request is sent to your ISP. Your ISP may not have a direct link to Youtube’s data centre, but thanks to its routing table, it will know how to get one step closer. The request is then passed along to another ISP and might go through several more locations before ending up at Youtube.
What happens if a rogue AS starts advertising routes to everywhere but then dumps all traffic?
But what if one of those servers along the way is malfunctioning? Or being routed by someone? Perhaps it really doesn’t know how to reach youtube.com? In that case, nobody using your ISP will be able to reach Youtube.
The simulation allows you to create and delete ISPs, configure links between them, or load in data from the real world. It will calculate the best routes between any two nodes using BGP, with each node storing its own routing table and acting independently from all the others, just as in the real world.
With SpatialOS, you could build simulations which could be used to investigate these kinds of failures and find the routes most likely to be affected. What happens if a rogue AS starts advertising routes to everywhere but then dumps all traffic? Would large portions of traffic be routed through? What if it then passed it on to the correct destination? Perhaps there are a few key routes we could protect that would mitigate the effects of such failures.
You could also use such a simulation to monitor routes in the real world and detect when they deviate from what the simulation predicts, to spot errors as they happen or to detect more malicious behaviour. Considering how many people rely on the internet, the implications are enormous.
Simulating thousands of AS networks on SpatialOS
The whole internet contains about 60,000 AS networks and over half a million routes. Since every AS has to store path information about routes to nearly every other AS, the routing tables can become unmanageably large. The ever-growing size of the routing tables has caused problems as the global routing tables of each AS grows.
In our simulation, every AS is running simultaneously and independently, so storing the full routing table for each AS requires many terabytes of RAM by itself. Doing this sort of simulation on a single server would be almost impossible; we don’t have the full list of networks and routes, but what we do have is enough to require 15 machines and over 1TB of RAM.
Building a distributed system of this magnitude without SpatialOS would have been an enormous undertaking and the result wouldn’t have been as flexible or easy to maintain. Building it during a three day sprint would have been unthinkable.
In SpatialOS, the model is expressed in terms of the Entity-Component-Worker architecture. A simulation is populated by entities, which are defined by their components. The components include properties that describe their internal state at any point in time, and have associated behaviours that implement their logic. These behaviours run in workers, a swarm of compute resources managed by SpatialOS to perform the simulation work.
In this simulation, entities are used to represent each Autonomous System and route. The associated behaviours of their components model the BGP protocol and the network flows themselves. All of the Autonomous Systems and routes act independently with no shared data; all information is shared by message passing using the Messaging API. These behaviours run in logic workers; these are managed by SpatialOS, which transparently spins up as many worker instances to support the workload created by this model.
The visualisation and interaction layer is built using the Unity game engine and the SpatialOS SDK. This is an unmanaged worker that runs on an end-user machine and serves the dual purpose of visualising the state of the simulation and letting the user influence it. Using a game engine lets us iterate quickly to build very attractive visualisations.
Building upon this model
The potential for the continued development of this model, and its application, is vast. Traffic shaping models could be applied within each AS to prioritise particular data flows; you could simulate ISP network filters and court mandated blocking orders. Having such a model would allow service providers to experiment with a network which, due to its autonomous nature, displays emergent behaviours that cannot be predicted.
This would provide large amounts of data that could be run through traditional real time and forensic network analytics, but now, through large-scale simulation, make it possible to study possible future events. Cyber security is one of the areas where a model such as this could have a significant impact on understanding network vulnerabilities, and how different types of attacks and exploits propagate across the internet. Applications could range from protection of critical national infrastructure, right down to a safer and more secure internet for individual users.
Experiment with a network which, due to its autonomous nature, displays emergent behaviours that cannot be predicted.
With a detailed simulation of the internet we could begin to prepare for cyber attacks before they happen, understanding the vulnerabilities and the cascading effects of various interventions better. This will enable businesses, institutions and even countries to become more resilient in an age of exponential vulnerability online.
But the possibilities don’t end there. This model is available for developers to integrate with their existing models today. In fact, in the past week we ourselves have begun to integrate this model with an ongoing project to model the infrastructure of entire cities. Using our platform, this was as simple as importing the internet simulation code into the city simulation project and setting up the physical locations of the networks correctly.
Being able to combine orthogonal simulations in this way will enable organisations to understand how the internet relates to other complex systems: cities, infrastructure, energy, economies. You could, for example, model a national power infrastructure network to comprehend not just its potential vulnerabilities but the impact it could have on a new transport network or the economy.
“We left Improbable at the end of the week wanting more, seeing numerous applications for this technology across our estate, and energised by the people and culture of a team on the edge of a new paradigm in cloud computing at scale.”