Banner image.

Information Security at Improbable

30 June 2019

Core Security is responsible for all technical aspects of information security at Improbable. Peter Mounce and Sherif Ragab take us through what that means in practice.

Peter Mounce is a software engineer at Improbable. 

At Improbable, we have a team of around 350 people and massive ambitions to empower our customers to create the worlds that they dream about. Sometimes, dreams don’t include, or gloss over, aspects of real life. In real life, there are people that seek to exploit others for money, revenge, greed, self-defense, opinion, power and boredom. Reads like a crime novel, right? Very dystopian.

This is why we care about security. It doesn’t feel good to be exploited. It doesn’t make good economic sense to run a platform that leaks our customers’ or their players’ data - why would you trust us to experience a world built on it if we failed to demonstrate taking good care of our customers and their players? It isn’t fun to play a game where some people are cheating.

So, inside Engineering Foundation, closely aligned with Engineering Velocity, sits our Core Security team. It's responsible for all technical aspects of information security. This spans the entire spectrum from defensive (access-control, system hardening, authentication) as well as responsive security (finding & fixing bugs, investigating incidents, log analysis, and so on).

Security principles

Core Security has a set of principles designed so that they act as a multiplier for our company’s ability to care about and maintain security:

  • Make the right thing to do also be the easiest thing. People are human and follow the path of least resistance; make that path incrementally the most secure. Create learning opportunities for our people to do the same. Insert security features and libraries at the surfaces and integration-seams of systems for greater impact.
  • Make it hard to aim at your foot. Teach and encourage prudence and safe-by-default. If something is also a loaded gun, don’t call it “a hammer” when talking about how to use it. Two concrete examples of how we do this are:
  •         Enforce observability standards. If our systems are observable, and we can make computers listen to them, we multiply our effectiveness at spotting problems that we’ve seen before. This means we can concentrate on exploring for new problems.
  •         Enforce least-privilege. Have the access you need at the time and no more. If you need more access, the process for getting it is auditable, so we can understand why things happened.
  • Learn from mistakes. People are human and make mistakes. Celebrate those as opportunities to do better rather than apportion blame. Uphold our culture of continuous improvement by analysing what went wrong and facilitating future actions to make that harder to repeat in future. We’re not perfect, but we can get closer.

Now Sherif Ragab is going to walk us through what the day-to-day of Core Security looks like.

Coming to Improbable

Sherif Ragab is a security engineer at Improbable who has worked as a security analyst, penetration tester and software engineer for various technology companies. 

In my experience, employees at tech-companies tend to exhibit a love of technology and computers going well beyond the scope of their work. They influence the culture to make it open to new and creative ideas, and are enthusiastic about trying new and audacious solutions. I've found this also comes with a willingness to move fast and fail early, as well as a strong commitment to high standards of technical delivery.

A bit of everything

My previous role was in Security at Google. Unsurprisingly, the nature of work in a startup is quite different from a tech giant - after all, at Google, just the responsive part of the corporate security team was comfortably larger than the entirety of Improbable.

One consequence of this was an almost diametrically opposed degree of specialization. Whereas in a considerably larger and more mature security ecosystem the career path involved becoming an expert in a very specific sub-subsystem, doing security at Improbable is much broader in scope. Here, I'm responsible for everything under the "security" umbrella: I do everything from playing around with YubiKeys to setting up PKIs, to thinking about access control, monitoring and incident response.

To give you a better idea of what this role consists of, I'm going to walk us through two examples.


Prior to my time at Improbable, my role at Google was focused on log analysis, intrusion detection and incident response. Over the years, we had built a sophisticated log-ingestion and analysis pipeline that we used to hunt for "badness".

Since most of that system was built using Google's internal solutions, making it easy to integrate with their existing logging infrastructure, I couldn’t use it here. Instead, I started working on trying to replicate those capabilities at Improbable.

My first solution was StreamAlert. It’s an open-source, log-processing pipeline, originally developed by Airbnb, based on AWS Lambda. Being a heavy user of cloud-based solutions already, this was a good fit for our internal security monitoring goals. It has:

  • A centralized system for all security logs.
  • A platform for writing alerting rules based on incoming logs.
  • A system to provide historical views into logs during investigations.
  • An ability to integrate with a wide range of technologies and platforms such as GoogleApps, GCP, AWS and others
  • No concerns about scalability or availability.

Understanding, collecting and processing logs

As a startup, we build much of our infrastructure on third-party services and open source tools, rather than engineering the stack in-house. This means integrating disparate systems across the board into centralized solutions can be more involved than an ecosystem where everything is designed from the ground up to be interoperable.

A logging solution for security monitoring would face the same challenges while at the same time being indispensable. Suspicious behaviour due to cyber attacks may only become apparent in event logs afterwards when the attack's footprint is analysed across a wider range of log sources. An appropriate logging system would have to:

  • Have all relevant logs stored in the same place.
  • Make log attributes (e.g. "username", "source-ip", etc) comparable across different log types to enable automated correlation detection.
  • Provide an effective platform for investigations: in response to concrete incidents or for proactive hunting.
  • Automate alerting/paging based on programmable criteria.

Writing meaningful rules

Initial work focused on getting a good body of security-relevant logs fed into and processed by the system, and setting up integration with our alerting pipeline (OpsGenie in our case). When we had gathered a large pile of logs, the next challenge was to help find the needle in the haystack - that is, write meaningful rules.

What "meaningful" means in this context is these rules fire on events security engineers classify as "interesting" (i.e. indicative of a compromise or actionable issue), while simultaneously not being too noisy (i.e. the false-positive rate is manageable).

Since we are pretty early in the log pipeline's development life cycle, this has been an ongoing effort where we started small with small and straightforward rules which depend on a single log-event. An example of this would be events which grant administrative privileges to a system - something which should be rare, and for which a relatively small and static whitelist can be created.

Bug bounties

Being an aggressively growing tech startup, one of the major challenges we face in the security team is keeping up with the rapid pace of engineering taking place across the platform. Having worked extensively on mitigating both internal as well as external bug reports, I knew how often exploitable bugs creep into large and complex code-bases and how finding them is essentially a numbers game.

Given that we only had limited resources to focus on information security, we couldn’t adequately inspect our systems with enough depth and breadth to weed out exploitable vulnerabilities to a comfortable level.

We, therefore, decided to leverage a gamified bug-bounty platform - a controlled way of opening up selected parts of our infrastructure to external hackers to find bugs and vulnerabilities. We decided to launch our program on HackerOne.

Eliminating low-hanging fruit

To avoid burning up our bounty pool with low-hanging fruit that is easily discoverable, we went through a phase of "pre-sweeping", during which we:

  • Set up and ran Nessus, working through a triage-mitigation cycle.
  • Automated the enumeration of our key assets with API clients and scripts, consolidating the metadata into a single queryable knowledge-base to assist security testing. This included sources like:
    •         DNS Zone files from domain service.
    •         Kubernetes master APIs.
    •         Protocol Buffers.
    •         Extracted from source-code.

We also engaged an external company to conduct a penetration test on parts of the infrastructure we felt particularly nervous about, like authentication and API services.

Drafting the scope and program policy

One of the challenges we have faced setting up the program was to make our complex ecosystem of services and tooling more "testable", and thus more attractive for researchers.

Before seriously looking at the platform from a security testing point of view, most of the engineering work has been focused on the development work-flow and on implementing new services and features.

Our infrastructure of gRPC microservices, in particular, was unusual enough to make it challenging to come up with a framework that would enable a tester to send arbitrary requests to any endpoint and any service running on that endpoint. To begin with, there was no single machine-readable authoritative index of endpoints, services and gRPC functions.

To compile this data in a repeatable way, we hacked together a set of scripts to:

  • Query the Kubernetes Master API across clusters for running services as well as external load-balancer configurations.
  • Parse the Zone File from our DNS provider.
  • Parse protocol buffers to extract service definitions, as well as corresponding request/reply data-structures.
  • Extract client-server relationships from source code by taking advantage of the consistent nature of the code generated from protocol buffers.

Combining data from these sources, we were able to build an all-encompassing map of exposed gRPC services and the function calls they expose, and request/reply structures.

Handling incoming bugs

The most interesting part of working on the bug-bounty program has been working on the actual incoming reports of vulnerabilities. Our process is:

  • Carefully read the report and verify the bug.
  • Identify the affected services and drill down on the corresponding code until the issue can be localized precisely.
  • Work on a fix. This ranges from a one-line change to a specific part of the code, to planning, delegating and following up on appropriate mitigation work.
  • Follow up on the fix until it's rolled out to all environments and we can verify the original bug is rectified.
  • Close reports and tickets, document relevant items and pay out bounties. For more serious vulnerabilities, we carry out a postmortem to document the cause, timeline and any mitigation work for the incident.

Life on the Core Security team at a technology startup

Those are just two of the many challenges I’ve tackled at Improbable. One of the advantages of a start-up in a wholly-new arena is that you’re being tasked with a disparate group of impactful problems that may have never been solved before - giving you the freedom to pioneer best practices and invent whole new solutions. I’m sure by this time next year, I’ll have even more war stories to tell you...

If that sounds like something you would like to care about too, we’d love to work with you - check out our engineering vacancies at the link below.