Here is a detailed summary of the video transcription in markdown format, broken down into sections for better readability:
Guiding Principles
- Everything fails all the time (quote from AWS CTO)
- Failures cannot be legislated against, focus on fast detection and response (quote from a VC2 founder)
- Resilience is not just about design and architecture, it's also about how the application and team responds to failures
Failure Scenarios
- Hard failures (e.g., AZ outage) are easy to detect and recover from using managed services
- Gray failures (e.g., application-level issues) are harder to detect and recover from, require better observability
Key Focus Areas for Resilience
-
Fault Isolation:
- Physical boundaries (availability zones, regions, global services)
- Logical boundaries (microservices, AWS accounts, "cellular" architecture)
-
Observability:
- Granular metrics and dimensions to detect issues early
- Composite alarms to provide a holistic view of application health
- Proactive operational reviews to continuously improve
-
Recovery:
- Static stability (pre-provisioned capacity, avoiding new runtime dependencies)
- Retrying with exponential backoff and jitter
- Routing around failures using deep health checks
- Application Recovery Control (ARC) for regional and zonal traffic shifting
- Aligning deployment and recovery actions
Capital One's Resilience Approaches
-
Authorization Platform:
- Zonal-independent cell architecture for low-latency, high-throughput transactions
- Deep health checks and a resilience engine to monitor and route traffic
- Fallback to degraded mode to maintain customer experience
-
Core Banking Platform:
- Regional cell-based architecture for tenant isolation and dynamic scaling
- Fault isolation to enable independent cell operations and prevent retry storms
- Leveraging AWS services (Route 53, ELB) for native failover capabilities
Key Takeaways
- Resilience is a continuous journey, not a one-size-fits-all solution
- Deep observability and understanding of business workflows are crucial
- Aligning application design, deployment, and recovery actions is essential
- Capital One's examples showcase how resilience can be implemented in mission-critical financial applications