TalksAWS re:Invent 2025 - Capital One: Building resilient systems by engineering for five nines (IND3308)

AWS re:Invent 2025 - Capital One: Building resilient systems by engineering for five nines (IND3308)

Building Resilient Systems at Scale: Capital One's Approach

Introduction

  • Capital One is regarded as a world-class organization in resilience practices across AWS
  • The presentation covers Capital One's journey as a platform company, the principles they believe in, and their focus on reliability, architectural patterns, and best practices

From Monoliths to Platforms

  • Prior to 2019, Capital One had duplicated capabilities across business lines, leading to non-standardized architectures and data duplication
  • In 2019, they declared themselves a platform organization, investing heavily in building scalable, reliable, and trustworthy foundational platforms

The Seven Pillars of a Modern Platform

  • The seven pillars that Capital One's modern platforms stand on are: reliability, availability, scalability, security, observability, agility, and governance
  • This presentation focuses on the first pillar: reliability

Reliability as the Engine of Trust

  • Reliability is measured by uptime (availability) and success rate (error rate)
  • Resilience plays a critical role in handling failures and ensuring platforms can recover from unexpected chaos
  • Reliability ensures the platform's capabilities function as intended over time, building trust with users

Architectural Evolution Towards High Availability and Resilience

  • Architectural anti-patterns in the pre-platform era:
    • All-in-one monolithic approach with single points of failure
    • Proliferation of tightly coupled microservices
    • Shared clusters for web/mobile and batch workloads
    • Single database for analytical and real-time loads
  • Architectural best practices:
    • Deploy to multiple AWS regions and availability zones
    • Use domain-driven design with bounded context services
    • Ensure regional affinity and no cross-region dependencies
    • Choose auto-replicating databases
    • Define strict RTO (Recovery Time Objective) and RPO (Recovery Point Objective)

Sharding for Finite Reliability

  • Poison pill requests can cause cascading failures across regions
  • Circuit breakers and rate limiters are reactive solutions, not enough for 5-nines reliability
  • Capital One developed a sharding technique using consistent hashing and shuffling to limit the blast radius of failures
  • This reduces the blast radius from 100% to 0.7%, achieving the required level of reliability

Serverless Adoption for Reliability

  • Challenges with EC2-based containers:
    • Scaling issues due to instance type availability and IP address exhaustion
    • Operational overhead of managing a fleet of EC2 instances
  • Serverless (ECS Fargate) provides automatic scaling, no operational overhead, and improved reliability
  • Careful use of AWS Lambda with concurrency limits to avoid account-level issues

Failure Modes and Resilience Testing

  • Potential failure modes: cloud provider failures, internal platform dependencies, external vendor failures, poison pill requests, engineer-introduced bugs, and untrusted code
  • Implemented a sandbox safety model to isolate untrusted code execution from critical platform services
  • Use AWS CDK (Cloud Development Kit) for infrastructure-as-code, enabling automated testing and drift detection
  • Leverage AWS CodeDeploy for zero-downtime deployments with lifecycle hooks for pre-validation and warm-up
  • Implement resilience testing using AWS FIS (Fault Injection Service) to simulate failure scenarios and validate recovery capabilities

Observability and Reliability Standards

  • Standardized logging, metrics, tracing, and error code conventions
  • Moved from static to dynamic alerts based on traffic patterns and error budgets
  • Implemented bulkhead patterns to isolate auxiliary services from core platform services

Building Trust and Reliability

  • Reliability, availability, and resilience are critical for customer trust, competitive edge, compliance, and company reputation
  • Every code commit must be responsible for the company's reputation and customer trust

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.