TalksAWS re:Invent 2025 - Capital One: Building resilient systems by engineering for five nines (IND3308)
AWS re:Invent 2025 - Capital One: Building resilient systems by engineering for five nines (IND3308)
Building Resilient Systems at Scale: Capital One's Approach
Introduction
Capital One is regarded as a world-class organization in resilience practices across AWS
The presentation covers Capital One's journey as a platform company, the principles they believe in, and their focus on reliability, architectural patterns, and best practices
From Monoliths to Platforms
Prior to 2019, Capital One had duplicated capabilities across business lines, leading to non-standardized architectures and data duplication
In 2019, they declared themselves a platform organization, investing heavily in building scalable, reliable, and trustworthy foundational platforms
The Seven Pillars of a Modern Platform
The seven pillars that Capital One's modern platforms stand on are: reliability, availability, scalability, security, observability, agility, and governance
This presentation focuses on the first pillar: reliability
Reliability as the Engine of Trust
Reliability is measured by uptime (availability) and success rate (error rate)
Resilience plays a critical role in handling failures and ensuring platforms can recover from unexpected chaos
Reliability ensures the platform's capabilities function as intended over time, building trust with users
Architectural Evolution Towards High Availability and Resilience
Architectural anti-patterns in the pre-platform era:
All-in-one monolithic approach with single points of failure
Proliferation of tightly coupled microservices
Shared clusters for web/mobile and batch workloads
Single database for analytical and real-time loads
Architectural best practices:
Deploy to multiple AWS regions and availability zones
Use domain-driven design with bounded context services
Ensure regional affinity and no cross-region dependencies
Choose auto-replicating databases
Define strict RTO (Recovery Time Objective) and RPO (Recovery Point Objective)
Sharding for Finite Reliability
Poison pill requests can cause cascading failures across regions
Circuit breakers and rate limiters are reactive solutions, not enough for 5-nines reliability
Capital One developed a sharding technique using consistent hashing and shuffling to limit the blast radius of failures
This reduces the blast radius from 100% to 0.7%, achieving the required level of reliability
Serverless Adoption for Reliability
Challenges with EC2-based containers:
Scaling issues due to instance type availability and IP address exhaustion
Operational overhead of managing a fleet of EC2 instances
Serverless (ECS Fargate) provides automatic scaling, no operational overhead, and improved reliability
Careful use of AWS Lambda with concurrency limits to avoid account-level issues
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.