Chaos engineering: A proactive approach to system resilience (ARC326)

Here is a detailed summary of the video transcription in markdown format, broken into sections for better readability:

Introduction

  • The session is presented by Adrian Cockcroft, Principal Engineer at Resilience, and Yana, an engineer on their team.
  • They are joined by Lex from the BMW Group, who will share their journey with Chaos Engineering.
  • The presentation will cover what Chaos Engineering is, why it's important, and practical examples of how to get started.

What is Chaos Engineering?

  • Chaos Engineering is a discipline of experimentation, where faults are intentionally injected into a system to gain confidence in its ability to recover from turbulent conditions.
  • It's a way to verify our assumptions about the system, as humans have many biases that can affect the design and operation of systems.
  • Chaos Engineering helps uncover these hidden assumptions and biases by observing how the system reacts to failures.
  • It's a continuous effort, not a one-time project, and should be seen as a long-term strategy, similar to cybersecurity.

The Cost of Outages

  • Outages can be extremely costly for enterprises, with the average cost being around $300,000 per outage.
  • Companies that practice Chaos Engineering recover from outages 16 times faster than those that don't.
  • Chaos Engineering helps reduce the time spent on the human-driven processes of detecting, evaluating, responding, and recovering from outages.

Chaos Engineering at Amazon

  • Amazon has been practicing Chaos Engineering since the early 2000s, using a program called "Game Days" to train their operations team.
  • Game Days are mandatory for every future launch of Amazon's free services, as part of the Operational Readiness Review process.

Regulations and Chaos Engineering

  • Regulations, such as the Digital Operational Resilience Act (DORA) in Europe, are now mandating that companies in the financial services industry perform Chaos Engineering and resilience exercises.
  • This is driving a significant increase in Chaos Engineering adoption, especially in regulated industries like finance and automotive.

Chaos Engineering Maturity

  • Chaos Engineering maturity typically follows a path of initial adoption, learning, failure mode analysis, ad-hoc experimentation, game days, and eventually continuous experimentation.
  • Failure mode analysis is a critical step, where the team identifies potential failure points in the system, such as fault isolation boundaries, dependencies, and bimodal behavior.

Hands-on Chaos Engineering Demo

Yana demonstrates Chaos Engineering in action, using the Fault Injection Service (FIS) tool:

Ad-hoc Experimentation

  • Defines a steady state for the application, and identifies assumptions to verify.
  • Injects network latency to the ElasticCache dependency, expecting the application to fail over to DynamoDB.
  • Discovers a timeout issue in the application code and fixes it, verifying the fix through a repeat experiment.

Game Days

  • Describes the planning, execution, and analysis of a "game day" exercise, where an Availability Zone is intentionally impaired.
  • Observes the application's behavior, including alarms, customer impact, and mitigation steps.
  • Identifies a hardcoded dependency on the impaired Availability Zone, leading to a mitigation plan.

Continuous Experimentation

  • Demonstrates how to schedule the Chaos Engineering experiments to run regularly, using the FIS tool.
  • Mentions the plan to automate and empower service owners to run their own Chaos experiments.

Lessons Learned from BMW's Chaos Engineering Journey

Lex shares BMW's Chaos Engineering journey:

  • Cross-team collaboration and knowledge sharing were crucial for the success of their Chaos Engineering initiative.
  • Securing leadership buy-in was essential, as running Chaos experiments in production requires upper-level support.
  • Fostering a culture of psychological safety, where failure is seen as an opportunity to learn and improve, was the most important lesson.
  • BMW started small, building confidence with lower-risk experiments, before moving to more complex scenarios like a full Availability Zone outage.
  • The journey is ongoing, with plans to automate and scale Chaos Engineering across their 1,300+ microservices.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us