To achieve software resilience, you must design in security, and treat it as part of software quality. In this conversation, author and security expert Kelly Shortridge shares with us exactly how to do that – covering topics such as the philosophy of software resilience, Evaluation and Experimentation, Self Serve Security, and what it means to be “Safe to Fail.”
Kelly Shortridge is a Senior Principal at Fastly and lead author of Security Chaos Engineering: Sustaining Resilience in Software and Systems (O’Reilly Media). Shortridge is best known for their work on resilience in complex systems, the application of behavioral economics to cybersecurity, and bringing security out of the dark ages. Shortridge is a frequent keynote speaker, advisor, and author and has been a successful enterprise product leader, entrepreneur (with an exit to Crowdstrike), and investment banker.
Software is going to fail sometimes, there’s no avoiding it. Make sure to join us for this episode to learn more about how to build resilient software so that those failures are not catastrophic to your organization!
Watch the video:
Show notes with links to jump ahead are below
Show Notes from Episode 23 – Kelly Shortridge on Security Chaos Engineering and Resilience
Timestamp links will open that part of the show in YouTube in a new window
- Introduction – Thinking like an Attacker
- 00:00 Kelly’s opening quote: “So actually, you, just doing your job, you can relate a lot. All you have to think about is that alternate’s perspective … how could I hold this system ransom if I was really mad at my employer? … You can get really close to the attacker mindset that way … They are going to take the easiest path to their goal … So as long as you think about the easiest ways, you’re going to eliminate a lot of low hanging fruit, and so that’s something that I hope empowers engineering leaders … maybe we can start thinking about this as part of our architecture and design thinking too?”
- After the opening quote, Arin and David talk about what is the most secure type of software. David talks about the importance of thinking about security at each step, and Arin jokes that the only way to build completely secure software is to never deploy it. David and Arin then talk about the philosophy of security and how that is a key theme in today’s episode.
- Arin introduces Kelly with their bio: Kelly Shortridge is a Senior Principal at Fastly and lead author of Security Chaos Engineering: Sustaining Resilience in Software and Systems (O’Reilly Media). Shortridge is best known for their work on resilience in complex systems, the application of behavioral economics to cybersecurity, and bringing security out of the dark ages. Shortridge is a frequent keynote speaker, advisor, and author and has been a successful enterprise product leader, entrepreneur (with an exit to Crowdstrike), and investment banker.
- Systems Oriented Security
- 05:45 Arin asks Kelly to give us a definition of systems oriented security and why the importance of working with security at the system level. Kelly notes that our audience is not necessarily cybersecurity people, which is fantastic because they actually wrote the book with that audience in mind because they think that cybersecurity is too often veiled as a mystical, arcane art. Kelly thinks security is a subset of software quality and you need both security and quality. Kelly continues, “So when we talk about systems oriented security, what we’re talking about is making sure that all of the interactions between things are resulting in a system that’s resilient to attack.” A lot of the book is trying to help any sort of leader or engineer kind of navigate how you build mental models of systems, because ultimately when we think about security, it’s how does the system respond to and adapt to any sort of failure, including things like attacks.
- The Philosophy of Software Resilience
- 08:01 David talks about the concept of Anti-Fragility and how Kelly’s description if software resiliency in the book is very innovative. Kelly responds that “I’m trying to drag cybersecurity out of the Dark Ages, which I think much of your audience likely agrees when they’ve had to interact with their security teams. It can feel very draconian, almost imperialistic with the thou shalt nots and imposing tools that the security team doesn’t have to use, but you do. It can feel a little unfair.” Resiliency is about responding gracefully to failures.
- Kelly notes that Engineering teams have to adopt more security responsibilities. Security problems need to be solved by design, which means those designing the software also need to understand security. Because “your cybersecurity team, most of the time doesn’t understand the system. You do. Right? They don’t know how to write software, but you do.”
- David and Kelly talk about epistemology and philosophy, and how you must have an objective understanding of reality in order to design for security and resilience.
- Evaluation and Experimentation
- 13:21 David brings up Chapter 2 of Kelly’s book and the mental models described in there. Kelly introduces the E&E approach for “Evaluation and Experimentation” that is covered in Chapter 2. The E&E approach is about baby stepping a system towards resilience. First you have to evaluate the design using architecture diagrams and by evaluating how data flows in the system. Ideally you develop a sense of how errors propagate backwards through the system. Decision trees can help here, as well as a behavioral economics and game theory concept called “belief prompting.” The ideas is that similar to a game of chess, you have to decide what an attacker might do next if you make a certain change in the system. This allows you to go through system failure analysis using thought experiments, without having to conduct physical experiments. A lot of value can be uncovered here before getting to the Experimentation phase.
- In the second phase, Experimentation, you can apply Chaos Experiments to observe how the system behaves under certain error conditions and adverse scenarios. Another term for Chaos Experiments is Resilient Stress Testing, which may be more easily accepted by management since these experiments may be done in production systems. These experiments help you uncover the realities of how the system works versus what you expected. Kelly talks about an example where a team assumed a firewall would catch certain types of traffic, but only actually caught it 60% of the time (leading to the reference to the movie Anchorman, where “60% of the time it works 100% of the time”). Many of the tools an engineering team is already using for cloud configuration can also be used to perform resilience testing.
- Security by Design
- 19:35 David brings up Chapter 4 in Kelly’s book, and the role of security in the day to day life of an engineer. Kelly talks about a key lesson they wanted to share, that “The cybersecurity industry thrives on these bolt-on tools and policies … Actually, if we want to achieve, resilience against attacks, a lot of it is more how do we implement these design-based defenses? How do we reduce hazardous methods and materials, as they call them, by design?”
- Kelly continues: “Like I analogize C code to lead. It’s really useful, but it kind of poisons us over time. So we can either refactor into a memory safe language, or we can use things like sandboxing and isolation. There are a bunch of ways we can treat it, but I think the core message I really wanted to communicate to software engineering teams or platform engineering teams is a lot of the practices you already have adopted and the things that you do for software quality uphold things like reliability. Turns out if you just think about them slightly differently, they apply to security as well.”
- Kelly discusses the importance of integration testing, which is not just about quality but also a great area to perform security tests.
- Security Testing in Loosely Coupled Systems
- 22:48 Arin brings up a myth that is described in Kelly’s book: Security at the component level does not add up to system level security. In other words, component level security testing is not enough. Given that, Arin asks “When you have a system architecture that’s loosely coupled, like a microservice or component oriented architecture, what is the best way to ensure system level security across those components?” Kelly notes that they are a fan of both Unit level and Integration testing, but “if you had to choose between unit tests and integration tests, I would vote integration tests basically every time” because of the extra security issues an integration test can uncover. A loosely coupled architecture can be helpful if it allows systems to fail independently of each other, but the behavior of the system overall also has to be tested to ensure that the failure of a given component is handled gracefully by the system.
- Kelly says that the most hardcore way to test for resiliency is in production, but that there’s nothing wrong with starting in a staging environment. This lets you test taking a system offline and seeing that message queues and brokers handle the absence of that sub system well. Staging environments also are a safe place to test by pushing the system to its limits, such as overloading the CPU. Just starting with a load test in a staging environment can be a big help.
- Self Serve Security
- 26:30 Arin asks about how best to integrate security experts into a Scrum team so the team doesn’t feel like they are an obstacle to work. Should a security expert be in each Scrum team or available across teams? Kelly says there are a number of models but they have seen work is the “Security Champions”, where the security expert acts as an advisor to teams. But if you want security to be treated like platform engineering as much as possible then it needs to be self serve, so there are long review periods. Tools like Dependabot and having the ability to upgrade dependencies via a simple button click are invaluable.
- Many security teams don’t understand how software is delivered, and that’s part of the reason they don’t take into account things like UX and delivery. This is an area that Kelly often educates security teams on so they can work more efficiently with the development groups. Serverless architectures also help with this notion of self serve security because the services are immutable and don’t require as many security checks – the cloud provider is doing much of that work for you and there is less of an attack surface to be taken advantage of.
- Kelly talks about how hackers are much more like software engineers than people realize. If you look at major criminal hacking organizations, they are investing in things like automation and infrastructure, just like platform engineering teams do. They are like a tech startup, and if you think about things from their perspective you can imagine a lot of the steps they may take and fix a lot of the low hanging fruit in your security defenses. Hackers will attack the easiest defenses, and so you eliminate the vast majority of your risk just by fixing low hanging fruit.
- A message from our sponsor: WebRTC.ventures
- 32:34 Building custom WebRTC video applications is hard, but your go live doesn’t have to be stressful. WebRTC.ventures is here to help you build, integrate, assess, test, and manage your live video application for web or mobile! Show notes continue below the ad
- Bringing in Outside Help
- 33:23 Arin asks about when an engineering team should bring in outside security help. Kelly responds that early in the design phase is a good time so that they security experts can help teams think about attack surfaces they have considered from their industry experience. The security can help teams construct things like decision trees to evaluate the attack paths. Building in security by design like that can eliminate 80% of the security issues, and a team should be very proud if all they’ve left attackers to exploit are zero-day vulnerabilities.
- The bigger point is to make sure that you minimize the impact of an issue. According to the Verizon Data Breach Investigations report, around 95% of attacks are financially motivated. Attackers are looking at how to get your data or your resources so they can make money, so if you can make it hard for them to make money then you reduce the vulnerability of your system. You can’t always control zero-day vulnerabilities, especially in third party software, but you can minimize the impact of those potential vulnerabilities on your system.
- Fail Safe vs Safe to Fail
- 36:00 Kelly explains the difference between “Fail Safe” and “Safe to Fail”, a concept from the book. Fail Safe implies that failures cannot happen and those failures have been prevented, which is not realistic. Safe to Fail means that even when failure happens, it’s not a big deal. For instance, having backup keys to your house makes it safe to fail, ie, to lose your keys. Designing any system that is valuable means we have to design complex systems, and complex systems will fail sometimes. We just need to ensure that failure doesn’t cause big problems. Focusing on being “Fail Safe” can stifle innovation because you end up putting too much bubble wrap on everything.
- Communicating Failures
- 38:50 Since some failures are inevitable, it’s very important to consider how engineering managers can best communicate those failures, especially in a stressful situation. Kelly talks about the importance of highlighting the business case, which is that resilience is not just actually adapting, it’s about adapting to evolving conditions. Those changing conditions can be bad like attacks, they can also be good like opportunities. Kelly talks about how many companies were late to move to the cloud because of cybersecurity concerns, but that was a missed opportunity and having better software resilience would have given them the confidence to move sooner. The message that engineering managers need to delivery is that they want to improve software resilience because “We want to help you succeed, and we want to help you better adapt as market conditions evolve.”
- Kelly continues: “I think it can be tricky when you’re telling an executive that things will fail. But I think the caveat is, wouldn’t it be better if we recover within seconds rather than recovering within hours? And that’s normally the trade off. It’s not a trade off between no failure and then failure happens, right? Failure is still going to happen. It’s just going to be way more onerous to clean up. It’s going to be more expensive for the business, it’s going to be harder to change things afterwards.”
- Effort Investment Portfolio
- 41:51 David asks Kelly to unpack the concept of an Effort Investment Portfolio. Kelly explains that they have a finance background, and so often think in financial terms: “The effort investment portfolio is basically the idea that our effort is finite, which a lot of people forget. We can’t invest in everything … You always want to make sure whenever you’re planning your strategy for the year, for the quarter, it’s like, okay, what do we expect in terms of return from this effort?”
- Kelly talks about adopting immutable infrastructure or refactoring to a memory safe language. “Refactoring involves an enormous amount of effort, but you also have to think about where that effort is happening. Maybe it’s easier for your product engineering teams, but maybe your SRE team is getting absolutely crushed from an incident perspective … Maybe you’re getting a ton of compliance fines because of all the incidents you’re having. Whatever it is, it’s really thinking about effort across, again, the whole socio-technical system and where we can invest a little more effort to save effort down the line.”
- Kelly explains that you cannot invest in every idea in the book, it’s too much. So you have to look at your own context, your skill sets, and where your efforts are best applied.
- The Ultimate Human Error: Blaming Humans
- 44:50 David brings up how the book talks about investing in things like logging and observability reduces the cognitive load necessary during an incident response, and this is an example of a good effort investment. Kelly states that “It’s a really important point that when things go wrong, if you’re blaming the human, that is the ultimate human error.” There are things you can change in the system that help the humans to handle the incident better. Most of us want to do the right thing, but the complexity of the system can be overwhelming and so we need to invest in managing that complexity. The emotional turmoil of an incident, on the part of the customer or a system engineer, is an indicator that something in the system needs to change to make it more safe to fail.
- Security Chaos Engineering
- 48:56 David asks Kelly to talk about the steps of performing a chaos engineering experiment. Kelly stresses it’s important to start with evaluation using a decision tree, and come up with your hypotheses. Choose one of those hypotheses to conduct an experiment around, and keep it small. Similar to a software deployment, you need to then document how the experiment will be conducted and keep everyone in the loop, like a release plan. Then deploy it and make sure you have all your data collection running, so that you can analyze the data after the experiment completes. Finally, write up your findings and share the lessons with the team. The hardest parts of this tend to be the social and team aspects, not the technical aspects of running the experiment. Make sure that you continuously improve on your experiments so the organization is learning over time!
- 51:52 Kelly Shortridge’s book is “Security Chaos Engineering” from O’Reilly, and there is a link in the show notes below to buy a copy. Kelly is on most social media platforms, and they can also be reached at firstname.lastname@example.org if you want to engage Kelly to work with your company. There’s also a chat at Shortridge.io if you want to get philosophical, and Kelly is happy to meet people at conferences or sign books.