TalksAWS re:Invent 2025 - Capital One: From Chaos Testing to Continuous Verification (SPS328)

AWS re:Invent 2025 - Capital One: From Chaos Testing to Continuous Verification (SPS328)

Summary of "AWS re:Invent 2025 - Capital One: From Chaos Testing to Continuous Verification"

Why Chaos Engineering Matters

  • In today's complex landscape, organizations can't just hope everything works - they need to know their systems are resilient.
  • Incorporating chaos testing early in development ensures resilience is built-in, not an afterthought.
  • Simulating network and system failures (e.g. latency, timeouts, packet loss) helps identify single points of failure and uncover unexpected behavior.
  • Proactively anticipating production-level failures (e.g. loss of VMs, AZs, regions) ensures applications can degrade gracefully and recover quickly.

Capital One's Transformation Journey

  • In a complex, distributed system, entropy is constant - code changes, configurations drift, dependencies update.
  • Point-in-time chaos testing creates a "resilience gap" as new vulnerabilities are introduced between tests.
  • Capital One shifted to a model of "continuous verification" to match the speed of testing with the speed of deployment.

Automated Reliability Verification Framework

  1. Capability Enablement:

    • Provided a controlled, regulated, and audited platform for engineers to run chaos tests.
    • Leveraged tools like FIS (Fault Injection Simulator) to test across compute, database, network, and other layers.
    • Built custom chaos capabilities beyond what off-the-shelf tools provided.
  2. Emergency Stop and Rollback:

    • Implemented telemetry and data to quickly identify when a chaos test is going awry and needs to be stopped.
    • Enabled the ability to roll back to a previous state after a chaos test, restoring the system.
  3. Service Level Measures:

    • Established clear service level objectives (SLOs) to define the acceptable performance and availability bounds.
    • Used chaos testing to learn what the appropriate SLOs should be for their applications.
  4. Continuous Verification:

    • Moved from periodic "game days" to running chaos tests continuously, matching the pace of system changes.
    • Verified that the system behaves as expected, even as it evolves, to identify and address new vulnerabilities.

Scaling Chaos Engineering and Measuring Outcomes

  • Provided a safe, controlled environment for engineers to build confidence in their ability to respond to incidents.
  • Enabled faster recovery from failures by familiarizing teams with their system's behavior and failure modes.
  • Improved overall system reliability and resilience by proactively addressing issues before they impact customers.
  • Established a culture of "safety first" with robust monitoring, logging, and audit trails to ensure responsible chaos testing.

Key Takeaways

  • Chaos engineering should be a core part of the development process, not an afterthought.
  • Continuous verification is essential to keep pace with the constant changes in complex, distributed systems.
  • Automation and tooling are critical to enable safe, controlled, and scalable chaos testing across the organization.
  • Establishing clear service level measures and objectives helps guide chaos testing and measure its impact.
  • A culture of safety, confidence, and continuous improvement is key to successfully implementing chaos engineering.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.