Bartek Plotka is a Software Engineer at Improbable, and he’s passionate about emerging technologies and distributed systems. He previously worked at Intel, has contributed to Mesos, and is a huge Golang fan.
At Improbable, we run dozens of services in multiple regions around the world, delicately orchestrated and scheduled using Kubernetes as our core system. The truth is, that these days, even a single Kubernetes cluster is really powerful – their last scalability benchmark resulted in 5,000 nodes and 150,000 pods.
This is really awesome! However, at Improbable, we decided to arrange our services in dozens of separate Kubernetes clusters for the following reasons:
- High availability (HA) across zones.
- Fault isolation, especially against operator errors.
- Avoiding provider lock-in.
- Service and data locality.
- Test clusters for canary, especially when you manage your own Kubernetes.
The simplest arrangement would be to just align these clusters in a single flat space as simple HA replicas. However, they are often arranged in a hierarchical federation which provides the above properties and allows partial network isolation, easier management, and meta-monitoring. At Improbable, we use the following structure:
In this example leaf clusters called deployment clusters are treated as HA replicas within a region. Therefore all deployment clusters’ services are not aware of the other clusters and can only connect to resources inside their own cluster. As a second layer, we have hub clusters that can access all deployment clusters’ services but are not aware of the other hub clusters. Hub clusters are designed to serve traffic to the leaf clusters in the same region, but they are also able to take traffic from hub clusters in different regions if required. At the top, we have a global cluster, ensuring a global view and enabling meta-monitoring. It has the ability to connect to all hub and deployment resources.
Obviously, decoupling clusters brings new problems to the table and, after over a year of managing multiple Kubernetes clusters at Improbable, we know which one frustrated us the most – the setup and maintenance of cross-cluster communication. In other words, a way to allow two Kubernetes pods, each in a separate cluster, to securely communicate (a service-to-service case), as well as allowing a user to securely connect to the cluster (a user-to-service case). For simplicity, we can treat these two types of communication as cross-cluster traffic.
We can divide this functionality into a few required elements. First of all, you need cross-cluster networking to be able to transfer the data. This usually means combinations of P2P VPN configurations and complex sets of routing rules, because each cluster needs to know about each other cluster’s three network ranges.
On top of that, you need federated service discovery. This ensures that you can find the correct address of the service in the remote cluster. The most common approach we have seen uses DNS stub zones supported by CoreDNS or the latest kube-DNS. Alternatively, Federated Kubernetes has emerged, supporting cross-cluster discovery.
Undoubtedly, the VPN with DNS stub zones approach gives both the operator and developer lots of flexibility, mainly because it feels like there are no cluster borders. You are connected to the “secure” VPN, and can resolve and connect to services from all federated clusters. It seems perfect – what could go wrong?
Unfortunately, the long-term results of using a VPN with DNS stub zones are unexpected:
- Difficult setup and maintenance of IPSec bridges and complex routes.
- Extremely hard to debug VPN connectivity issues.
- Lack of proper tracing for user traffic inside VPN network, limited auditing.
- No fine-grained control over resources.
- Immature DNS stub domains support.
Apart from the above, at Improbable we have numerous additional caveats about the subtle interplay of all the components sitting together, like MTU limits of IPSec tunnels on certain UDP DNS packets.
Having seen all of these issues, we started looking for alternatives. Kubernetes federation is no better – it’s also a single point of failure in terms of an operator error, as a single configuration error can potentially take down the whole control plane. Istiohas the potential to solve our problems, but the cross-cluster case is far away on their roadmap. At the same time, none of the existing service meshes or network overlays provide an easy fix.
All of this is why the kEdge project was started. Born from the pain of setting up communication between Kubernetes clusters, kEdge is a reverse proxy solution that uses a simple mapping to route requests to a destination (usually a Kubernetes service). When compared to VPNs and DNS stub zones, kEdge provides us with the following:
- Reduced maintenance cost
- Improved debuggability, allowing us to trace each individual request
- Fine-grained access control
- No dependency on DNS stub zones (with extra configuration, no dependency on DNS at all!)
kEdge itself is just a small binary that serves as a stateless gRPC and HTTP proxy mechanism. It is meant to be used as a reverse proxy on the edge of each isolated cluster. It can be used for service-to-service communication and user-to-service traffic.
A high-level overview of kEdge can be seen in the following diagram. As you can see, kEdge itself consists of just a few components – an Auth layer, a Mapper and a Reverse Proxy.
The Auth layer is responsible for authenticating and authorising end-user’s request. It can be configured to use client TLS certificates and the OpenID Connect flow. The latter is extremely useful at enabling quick and reliable authorisation of a request that came from the authorised human user (stay tuned for another Improbable blog post on that!). We also have plans to extend the auth layer to support expecting different permissions per route. This will allow more fine-grained control over access to certain services.
Every authorised request goes to the Mapper. A Mapper is a simple component that matches an incoming request to a particular “backend”. The backend is nothing more than just a short description how to forward the request its target service. The following snippet is an example configuration proto explaining what data is kept for HTTP backends:
The mapper is able to match a route to a backend based on request host, port, path, service name (in case of gRPC) or even headers.
The request then goes to the Reverse Proxy component that is responsible for resolving to IP address(es), load balancing to the chosen backend and finally sending the request. The main part of the backend configuration used here is the resolver. It also certifies the request within the cluster if mTLS is configured.
We’ve also added integrations to make life with Kubernetes easier! For instance, one of our integrations is the Dynamic Routing Discovery. You can imagine that having all the routes and backend details configured manually is really tedious and prone to errors.
This is where our Dynamic Routing Discovery comes in handy. It is an optional component that deducts routing -> backend pairs from the Kubernetes Service itself! All you need to do is to specify a single label, indicating that the service is meant to be exposed via kEdge. There are also numerous options to tweak the resulting backend domains, name, and so on.
Another integration we’ve included is an enhancement for the resolver itself. For example, what if you don’t trust your DNS setup? We can avoid using it with our custom k8sresolver. Our resolver is able to resolve a common kube-DNS-like domain (<service>.<namespace>(|.<any suffix>):<port|port name>) using the kube endpoints API to substitute IPs and port. In fact, you could even remove kube-DNS from your cluster entirely, as it is not used in this approach.
Multiple Cluster Scenario
To setup kEdge for a single cluster, simply run a few replicas behind a L3 load balancer. From the kEdge client perspective, it does not matter whether it is a pod or user’s dev machine, as it is quite straightforward to setup communication when you have only a single cluster. As a result, you can use any client that can easily proxy the desired request (with proper auth and/or certificates) through the configured domain (e.g using HTTP_PROXY env). In the future, we might want to add HTTP CONNECTtunnelling support for secure TLS connections – once Go 1.10 is released, that will add client supportfor it.
While it is reasonably simple to pass a single domain as a proxy host, it is not so straightforward to route requests when you have multiple clusters available via kEdge. As before, we need to run a reasonable number of replicas on each cluster’s edge within some certified domain. Let’s say cluster-1.proxy.example.com for the first cluster, cluster-2.proxy.example.com for the second one, cluster-N.proxy.example.com for cluster N, etc. This will expose the cluster existence over public DNS, but it is a tradeoff we can sacrifice for the lower operating cost.
With this approach we need some way to forward a request to the desired kEdge domain with proper authentication and certificates. We can use the following approaches:
- Native “dialer”
- Local proxy
For both options, it is very convenient to establish some internal domain that will be recognized as a service behind the kEdge routing. For instance, we can define nginx.cluster-1.internal.example.com for a kubernetes service nginx running in cluster-1.
The simplest and most appropriate solution for service-to-service communication is to have a native client that has a custom dialer implementation which maps the pod internal domain to the proper cluster. This approach can be seen in the following diagram:
In this case the kEdge dialer (client) within Pod A knows the logic for kEdge mapping and so is able to deduce which cluster request should go where, thus allowing a secure connection to Pod B. The kEdge project includes two golang implementations for gRPC and HTTP. The obvious limitations of this approach are:
- You need as many implementations as languages you are using.
- Instead of a single point of truth for mapping configuration and authentication, you have to configure every single service client that reaches outside clusters.
Another solution we can use is to introduce a local proxy daemon that has similar logic to the native client but is not bound to each client – instead, it is running as a separate binary on the same machine as the client. This solves the limitations of native handlers by proxying all client requests to the same local endpoint that is able to forward them to the proper kEdge-enabled cluster if needed.
Our kEdge project includes one local proxy implementation called “Winch”. You can easily configure that stateless local daemon to understand what requests should go to which kEdge and what how to authenticate it. The diagram below shows an example setup including Winch and kEdge for a user-to-service case. Obviously it can be mapped to a service-to-service case accordingly.
The example uses OpenID Connect for authentication, but it can be switched to mTLS as well. During this flow, a client is proxying through local winch using plain HTTP. This is safe since we are proxying traffic using only the localhost interface. (When HTTP Connect Go support is mature, it will also be possible to switch to TLS). Winch then establishes a TLS connection with kEdge and authenticates it. After kEdge authorizes the request, it goes to the correct pod.
Winch is also able to “inject” authentication for backend services. This is very useful for certain clients that do not allow you to send auth tokens through plain HTTP headers (like kubectl) but can be extended to any additional auth client needs through a simple configuration. You can also specify a particular kube config’s “user” as an auth method, which avoids some duplicated configuration, because you can store all your auth options in your kube config.
All the above makes the local proxy (Winch) a perfect choice for user-to-service communication.
- All clients can be language agnostic, you just proxy the TCP/IP request.
- You have a single point of truth for mapping configuration and auth.
After using kEdge at Improbable in production for a couple of months we can confirm that it improves our cross-cluster communication. It has made our lives significantly easier: no longer do we have to spend time configuring and maintaining flaky VPN services between our clusters. Additionally, we have much more control and oversight: we’re now able to trace individual requests and block other requests entirely based on per cluster authorisation. Finally, we’re excited to use kEdge to remove our dependency on advanced cross-cluster service discovery as DNS stub zones.
Would you like to contribute?
There are certain items and improvements that we still would like for kEdge. You can find the full list of wishlist features and enhancements here.
If you’re interested in helping out, feel free to visit https://github.com/improbable-eng/kEdge and help us improving the kEdge experience! All contributions via Pull Requests and feedback via GitHub Issues are very welcome. The kEdge project is released under Apache License 2.0.
Are you an engineer or interested in working in tech? Good, we’re hiring!