Here is a detailed summary of the video transcription in markdown format, broken into sections for better readability:
Introduction
- The session is presented by Adrian Cockcroft, Principal Engineer at Resilience, and Yana, an engineer on their team.
- They are joined by Lex from the BMW Group, who will share their journey with Chaos Engineering.
- The presentation will cover what Chaos Engineering is, why it's important, and practical examples of how to get started.
What is Chaos Engineering?
- Chaos Engineering is a discipline of experimentation, where faults are intentionally injected into a system to gain confidence in its ability to recover from turbulent conditions.
- It's a way to verify our assumptions about the system, as humans have many biases that can affect the design and operation of systems.
- Chaos Engineering helps uncover these hidden assumptions and biases by observing how the system reacts to failures.
- It's a continuous effort, not a one-time project, and should be seen as a long-term strategy, similar to cybersecurity.
The Cost of Outages
- Outages can be extremely costly for enterprises, with the average cost being around $300,000 per outage.
- Companies that practice Chaos Engineering recover from outages 16 times faster than those that don't.
- Chaos Engineering helps reduce the time spent on the human-driven processes of detecting, evaluating, responding, and recovering from outages.
Chaos Engineering at Amazon
- Amazon has been practicing Chaos Engineering since the early 2000s, using a program called "Game Days" to train their operations team.
- Game Days are mandatory for every future launch of Amazon's free services, as part of the Operational Readiness Review process.
Regulations and Chaos Engineering
- Regulations, such as the Digital Operational Resilience Act (DORA) in Europe, are now mandating that companies in the financial services industry perform Chaos Engineering and resilience exercises.
- This is driving a significant increase in Chaos Engineering adoption, especially in regulated industries like finance and automotive.
Chaos Engineering Maturity
- Chaos Engineering maturity typically follows a path of initial adoption, learning, failure mode analysis, ad-hoc experimentation, game days, and eventually continuous experimentation.
- Failure mode analysis is a critical step, where the team identifies potential failure points in the system, such as fault isolation boundaries, dependencies, and bimodal behavior.
Hands-on Chaos Engineering Demo
Yana demonstrates Chaos Engineering in action, using the Fault Injection Service (FIS) tool:
Ad-hoc Experimentation
- Defines a steady state for the application, and identifies assumptions to verify.
- Injects network latency to the ElasticCache dependency, expecting the application to fail over to DynamoDB.
- Discovers a timeout issue in the application code and fixes it, verifying the fix through a repeat experiment.
Game Days
- Describes the planning, execution, and analysis of a "game day" exercise, where an Availability Zone is intentionally impaired.
- Observes the application's behavior, including alarms, customer impact, and mitigation steps.
- Identifies a hardcoded dependency on the impaired Availability Zone, leading to a mitigation plan.
Continuous Experimentation
- Demonstrates how to schedule the Chaos Engineering experiments to run regularly, using the FIS tool.
- Mentions the plan to automate and empower service owners to run their own Chaos experiments.
Lessons Learned from BMW's Chaos Engineering Journey
Lex shares BMW's Chaos Engineering journey:
- Cross-team collaboration and knowledge sharing were crucial for the success of their Chaos Engineering initiative.
- Securing leadership buy-in was essential, as running Chaos experiments in production requires upper-level support.
- Fostering a culture of psychological safety, where failure is seen as an opportunity to learn and improve, was the most important lesson.
- BMW started small, building confidence with lower-risk experiments, before moving to more complex scenarios like a full Availability Zone outage.
- The journey is ongoing, with plans to automate and scale Chaos Engineering across their 1,300+ microservices.