A decade ago, Netflix created a concept called chaos engineering to test the resilience of its systems as the streaming media company moved its systems to the cloud.
Today, two proponents of the concept tout how chaos engineering can be used in cybersecurity.
In 2011, Netflix built Chaos Monkey, a chaos engineering tool that helps the company launch attack code against its infrastructure. The goal of the tool is to reveal failure points to help the company improve resilience and maintain uptime for its users. It has since helped the company lessen the impact of failures or avoid them completely.
So, what does that mean for security? Kelly Shortridge, senior principal at Fastly, and Aaron Rinehart, co-founder and CTO at Verica, are ready to explain.
“The problem is that there’s a lack of recalibration for a continuously adapting system,” Rinehart said. “We don’t know we need to adapt because the software itself doesn’t inform security about an issue until security knows there’s a problem. That is not a good way to approach engineering security.”
They began raising awareness of security chaos engineering at conferences such as RSA and Black Hat. Then, in 2020, they published a book on it.
Here, Rinehart and Shortridge explain how they became interested in security chaos engineering, what companies can do to implement security chaos engineering testing and more.
Editor’s note: This following interview was edited for clarity and length.
What got you interested in adapting chaos engineering for security?
Kelly Shortridge: It started with reading a book about earthquakes and learning about resilience. I went down a rabbit hole and started looking at how resilience applied to computer systems, which is normally in the realm of distributed systems. For a few years, I felt like I was speaking into the void. Then, I noticed Aaron [Rinehart] talking about chaos engineering applied to information security. Aaron reached out to me and said, ‘I was approached by O’Reilly to talk about security chaos engineering — do you want to join forces?’ Conveniently, I had already started drafting a book on the topic.
Aaron Rinehart: We didn’t know each other. We worked on security chaos engineering from different angles. I was the chief security architect at UnitedHealth Group at the time. I started experimenting with chaos engineering and felt it made sense from a security perspective. Security suffers from the same problems you’re trying to solve with chaos engineering. You should proactively verify a system does what it’s supposed to do. What’s hard with modern systems is there are so many people with their hands in the pot. It’s hard to know everything and be confident about how it’s running.
Shortridge: I came into the industry from an investment banking background and economics before that. For a few years, I was talking about infosec problems from a behavioral economics lens and from the use of public cognitive biases and so-called irrationalities that plague some of our thinking and result in inefficient outcomes. I wondered how we could make a systemic improvement to eliminate these inefficiencies. That’s when I explored resilience engineering. Aaron came at security chaos engineering from a practical lens, and I came from a more philosophical lens.
Why should companies use security chaos engineering to test systems?
Rinehart: I see chaos engineering and its application to security as solving a new problem that’s manifesting post-deployment. The software engineering community realized systems are way too complex. Systems are unpredictable, and we aren’t communicating about why something may happen. We have microservices, cloud computing, DevOps, continuous integration/continuous delivery, etc., and we’re delivering things so much faster than before and at scale.
Microservices, for example, are dependent upon each other. We often have siloed teams for each microservice, and each team will have its own understanding of the overall project. For any one app, you may have 10 different independent microservices. Each team is then making changes to each microservice for a system they think they understand but don’t fully. There will be multiple versions of each microservice since they’re dependent upon services that may be aligned to older versions. It may appear they’re running well, but it’s hard to mentally model the system as it is with all those microservices and their varying versions.
Security is similar in the dependent aspect. You must understand the systems, or you don’t know what needs to be secure.
Shortridge: By using security chaos engineering and conducting experiments, you get to understand your systems. You’re uncovering different behaviors that can happen when you put the system under stress. You build muscle memory around how to respond to failure and how to respond to incidents. Then, when the inevitable happens, you feel a lot more ready.
How does a company implement security chaos engineering from a cultural angle?
Shortridge: Adopting the security chaos engineering culture is about understanding there are learning opportunities. Humans and computers make mistakes; we need to tolerate that and learn from it. Changing cultures isn’t easy, but it’s something any company can implement. Make learning transparent, and reduce blame so people feel empowered to talk about problems that emerge.
I’m a huge proponent of thinking more pragmatically about threat modeling as a place to start with security teams. Instead of worrying about a fancy kernel exploit, bring in engineers working on the systems, and ask how they would exfiltrate data from the system if they were in a hurry. Explore all the possible paths within a system because attackers want the easiest ways in. With this honest look at the system, see what faults emerge. This helps promote a system mindset and collaboration between security and engineering.
I gave a talk at RSA on how to harness the scientific method, which is what chaos engineering is all about, and come up with these hypotheses. A lot of that is through using things like attack trees to see, for example, if there are any AWS S3 [Simple Storage Service] buckets or any cloud storage buckets with sensitive data. From there, make sure it’s not set to be public. Some of those basic things are an obvious place to start. You don’t have to get super fancy because part of it involves running security chaos engineering experiments to build muscle memory. Start out with small experiments you can tackle manually.
What about implementing security chaos engineering testing from a more practical mindset?
Rinehart: Start with game-day exercises. We did the same thing for engineering, and you can apply it to security use cases. Some engineering use cases apply to control validation. So, try to verify the technology. Gather everyone, and introduce problems into the system to see how it reacts. It’s best to use scripts or codes so you know exactly what changes you introduced and their results and can easily undo them afterward.
Next, think about logic around security systems and potential failure points. Think about how security controls serve as preventive measures. We build logic under the assumption that, during a condition, it will function to provide system availability or keep it secure. But does that still happen? Or has the system changed so much that the logic never gets executed as intended?
Another aspect to consider is security observability, which is one of the most difficult problems in software security today. It’s never been solved effectively before. The problem is that, when an incident does occur, you probably don’t know which log data is good or bad — unless you proactively assessed the problem into the system by testing it. I bet 90% of people wouldn’t detect any issues. People don’t look at individual logs from individual endpoints, instead they look at log volume. Would you notice 1,000 missing logs? Or 10 million? If you can’t detect when that happens, that’s a problem. The industry has tools that collect all logs in a day, and you’re sure you’ll predict the next incident. That assumption is fundamentally flawed because you assume you have all the correct data to begin with — and that the data makes sense. Security chaos engineering gives you the opportunity to introduce problems and see how the system reacts.
Are companies worried that implementing security chaos engineering will add more work to someone’s plate?
Rinehart: I heard that early on, but I also initially couldn’t really explain what I was doing in terms of value. I started asking companies, ‘What informs you that any of those things you’re doing works? How do you know?’ I get the worry. Security does a lot of things for the business, but many of these things are misaligned about where value gets delivered. If we’re going to spend money on it, we should make sure the tools are effective at doing what they’re supposed to do — and not just once. We’re making changes constantly, and we need to make sure everything still works as intended.
Deception technology, for example, adds more work for a security team, especially if you’re not sure how effective your [current security strategy] is. Why add a bunch of honey tokens or fake infrastructure when you don’t really know how vulnerable your main security tools are? With security chaos engineering, you can be confident and have the data to back up how your current tools work. Then, deception technology gets more interesting because you’re confident in what you already have.
Shortridge: One hot take I have is that you can take any ops or software engineer, and they will have the sufficient skills needed. There are things like CRON and Jenkins that can run some of these things — it’s basic stuff.
So, despite it adding more work, security chaos engineering helps security teams trust their tools function as intended?
Shortridge: One of the core issues in the security industry is that a lot of security professionals don’t understand how software is designed, developed and delivered. They’re familiar with some of the tools needed but aren’t familiar with things like automation. With security champions and security chaos engineering models, you have a more decentralized security where you can bring in engineers and make them accountable for their work. There aren’t really any decent ways to measure how our security products contribute to better security outcomes. Meanwhile, [security products] can increase the budget and can certainly increase head count. There are plenty of tools that, because they generate so many alerts, you have to hire more people. It’s almost like every security tool you add increases complexity, so you have to hire more people, and then you try to buy tools to like scale it. It’s like this kind of Kafkaesque nightmare.
What we don’t know is whether the products and complexity actually stop attackers. Should we be buying all these tools? Security chaos engineering is a great way to cut down on complexity because you can focus on what’s going to matter, what will make an impact and have a real outcome. I think security people are a little nihilistic or fatalistic at this point, as if it’s impossible to determine how useful various security products are. We want to show that, no, it is possible to figure out what’s effective and not for you.
You already released one book on security chaos engineering. Anything else forthcoming?
Rinehart: It’s still a relatively nascent field, which is why we’re currently writing a full book. It’ll be out in 2022 and will include more case studies from companies implementing security chaos engineering. We also want to go into code examples, with a plan to look through the brand-new Kubernetes tool designed to take advantage of security chaos engineering testing.