SpatialOS

Company

Careers

}

Introducing KCP: a new low-latency, secure network stack

Improbable is excited to announce that SpatialOS now supports KCP, included in the SpatialOS 13.6.0 release. Our new KCP-based network stack is built around the third-party KCP protocol library and provides an additional alternative to RakNet and TCP. It offers encryption, configurability, cheap stream multiplexing and improved latency on unreliable networks at the cost of additional bandwidth overhead compared to TCP and RakNet.

If you expect your client-workers to connect over wireless or mobile networks, we think you should strongly consider configuring your client-workers to use KCP for time-sensitive updates such as character game positions or shooting commands and events. This will help to ensure that your SpatialOS game feels responsive to end users. TCP still remains a sensible default choice for server-workers running in the reliable networks within the data centres that run your SpatialOS deployment.

In this blog post, we will explore a SpatialOS worker’s network stack, focusing on how it differs for KCP compared to TCP and RakNet. We will highlight key features of KCP and see how they can be useful in certain circumstances. Alternatively, you can get started with KCP right now.

KCP network stack: a comparison with TCP and RakNet

To understand what value KCP adds, let’s look at what happens to each layer of the network stack when your worker sends data to the SpatialOS runtime using a connection to a bridge.

Once your worker has established a connection with a bridge, all data transferred between the worker and the bridge passes through the following layers:

The KCP network stack.

The KCP network stack.

The first layer is the application layer, which implements the worker protocol. High-level features of SpatialOS such as creation, updates and interest for entities and components are implemented at this layer.

Reliable transport layer

Generally speaking, in real-time multiplayer games built on SpatialOS, you want updates to be delivered between your client-workers (on your players’ machines) and the bridge as quickly as possible. Delayed updates can reduce both the perceived quality of your game as well as feel unfair to players if they affect the game’s outcome.

Each update sent from your client-worker, as a worker protocol message, must take a long and treacherous journey to the bridge. The packet is typically sent via a router on the player’s local network and a series of IP-enabled hosts on the public internet before finally arriving at the bridge. Every hop between each host from the game client to the bridge brings with it the chance of the packet being delayed or dropped altogether.

All the client-worker can do is hand the packet to the network layer and wait in hope for an acknowledgement (“ACK”) of receipt of the packet from the bridge. But what if that acknowledgement doesn’t come?

That’s where the reliable transport layer of the network stack comes into play. It is in charge of deciding when to transmit (send) packets. When a packet is scheduled for transmission depends on factors such as the remaining capacity the bridge has (flow control windows) and how quickly the receiving side acknowledges packets (packets have to be retransmitted when they are detected as lost, often after a timeout). This scheduling is implemented differently for each of TCP, RakNet and KCP.

Why not use TCP?

The problem with TCP is that it was designed a long time ago. TCP was designed to incur minimal bandwidth overhead and avoid congestion on large shared networks. This made sense at the time; bandwidth was scarce and real-time multiplayer games didn’t exist. Lost packets on traditional wired networks usually indicated congestion on intermediate IP hosts. Therefore, in response to packet loss, TCP significantly reduces the number of packets it allows to be transmitted at once and retransmits each lost packet at exponentially increasing intervals. This ensures that shared networks like the internet ran smoothly as a whole, in exchange for occasional latency spikes and reduced throughput.

However, on modern, wireless networks using protocols such as Wi-Fi (802.11) and 4G, interference from other transmitting devices and signal degradation often leads to packets being corrupted in transit. Although some variants of wireless protocols attempt to retransmit data, they usually give up after a finite number of retries and simply discard the packet, which eventually manifests as packet loss at the transport layer.

If we combine these properties of modern networks with our desire to deliver data as quickly as possible during a real-time game, it’s clear that we need alternative protocols with different priorities to those of TCP.

(We are planning support for truly unreliable communication in the future - watch our blog for more details.)

Why use KCP?

Our alternatives are KCP and RakNet. Both assume that delivering each packet as quickly as possible is more important than the efficient use of bandwidth and conservative congestion avoidance. Amongst other techniques, KCP retransmits packets more quickly, anticipates packet loss before retransmission timeouts, and backs off more slowly when packet loss is detected. (You can read more about the differences between the KCP and TCP reliable transport protocols here.)

Another technique that the KCP network stack uses to mitigate delayed updates is stream multiplexing. KCP splits updates for different entities into different, independent, ordered streams where possible. While this feature is also available for both the RakNet and TCP network stacks, the runtime and memory overhead associated with each KCP stream is small (whereas each TCP stream requires a new TCP connection) and you can have as many streams as you want (RakNet is fixed at 32). Therefore, you can choose the multiplex level of your worker’s connection to handle the number of entities you expect your client-workers to visualize or interact with.

Erasure coding layer

The erasure coding layer is only available as part of the KCP network stack and provides a further boost to fast, reliable transport.

But what is erasure coding? In the context of reliable transport, it is the technique of generating additional, redundant “recovery” packets to send along with “original” packets which contain your data, in such a way that allows lost packets to be reconstructed by the receiver without requiring the sender to retransmit the original packets. One of the simplest forms of erasure coding involves sending duplicates of each packet such that, if any single duplicate makes it to the destination, no data has to be resent. We have integrated a configurable form of erasure coding into the KCP network stack known as Maximum Distance Separable (MDS) codes. See the documentation for how to configure it for your worker.

Encryption layer

Another layer where the KCP network stack offers unique functionality is the encryption layer. In the first version of the KCP network stack, any data sent over the network is encrypted using DTLS. DTLS stands for Datagram Transport Layer Security and works in much the same way as TLS (the TCP equivalent). It prevents attackers from eavesdropping on data that your worker sends and receives. It also prevents attackers from disrupting other game clients’ sessions by attempting to hijack the session or forging messages from them.

These techniques can deter, and go some way to preventing, cheating but it should be noted that network layer security is only half the battle. You should combine it with game feature implementation choices such as server-side authority, for features such as hit detection, in order to best defend against cheating.

Currently, DTLS is enabled by default for KCP, but users will be able to opt out in the future since encryption incurs an additional bandwidth and CPU overhead.

Performance analysis

Now that you know how the KCP network stack tries to deliver data packets quickly on unreliable networks, let’s take a look at the results of some experiments we performed within Improbable to compare its performance with RakNet and TCP.

Each experiment is set up as follows:

  • A “sender” worker runs on a Linux client machine in Improbable’s office in London. It sends one component update on a fixed number of entities, each tick, to a deployment running in an EU cluster. Each component update contains 50 bytes of data (excluding overhead from lower level protocols), including a timestamp recorded on the client machine. The tick rate is 60Hz (i.e. 60 ticks per second).
  • A “receiver” worker runs on the same client machine, reads the update when it arrives from the bridge, and calculates a “round-trip time”, in milliseconds, between when the update was sent from the “sender” worker, and when it was received on the “receiver” worker. Each round-trip time recorded is a single sample.
  • A Linux program for simulating packet loss when sending on a specific network interface is run, giving each packet a 0.5% chance of being dropped.
  • Each experiment tests a single network stack at a time and collects samples for 5 minutes, resulting in 100,000s of samples in each data set.
  • For TCP and RakNet, the default parameters provided by the C# worker API are used. KCP uses the following parameters and defaults for parameters not listed:
FastRetransmission: true
EarlyRetransmission: true
NonConcessionalFlowControl: true
MultiplexLevel: 32
UpdateIntervalMillis: 10
MinRtoMillis: 10
EnableErasureCodec: true
ErasureCodec.OriginalPacketCount: 10
ErasureCodec.RecoveryPacketCount: 3

Note that this scenario is atypical in the sense that the client is both sending and receiving the updates (rather than communicating with a server-worker), but this scenario allows us to calculate the timestamps on the same machine and incorporate both the performance of sending updates to the bridge as well as receiving updates from the bridge in the same results.

The following graphs show the percentiles of packet round-trip time (in milliseconds) for the delivery of packets from a “sender” client-worker to a “receiver” client-worker via the bridge.

25-entities-blog

50-entities-blog

Both graphs provide evidence for the argument that, on unreliable networks, KCP is superior to both TCP and RakNet for minimizing worst-case latency. Specifically, for 25 entities, the longest round-trip time out of all 436271 packets was 51ms, compared to 114ms for RakNet. Compared to the average packet, this corresponds to a worst-case delay of ~30ms (about 2 frames at 60Hz) for KCP compared to ~90ms (5-6 frames) for RakNet. This gives you an idea of how noticeable the resulting lag might be to the human eye.

For 50 entities, the effect is even more exaggerated. The 99.8% percentile stands at 44ms for KCP compared to 243ms for RakNet, and the maximum round-trip time was 83ms for KCP compared to 327ms for RakNet. TCP is so poor under 0.5% simulated packet loss that it is not even worthy of comparison.

It is important to note that these results are sampled under a combination of real and (probabilistically) simulated network conditions, both of which can vary between each experiment. The scenario of 25-50 updates per second on 50 entities may be roughly representative of your game, or completely different. We encourage you to design your own test scenarios and come to your own conclusions about what is best for your game. Note that you can use metrics to help inform your decisions during optimization and monitor performance during testing and production.

Get started now

The new APIs for KCP are available to use in all supported languages in version 13.6.0 of the worker SDK. Once you have updated your SpatialOS project to use this version, you can consult the API reference to integrate KCP into your project and use the documentation for help with configuring KCP for your worker.