Talks Building resilient applications on AWS with Capital One (ARC334) VIDEO
Building resilient applications on AWS with Capital One (ARC334) Here is a detailed summary of the video transcription in markdown format, broken down into sections for better readability:
Guiding Principles
Everything fails all the time (quote from AWS CTO)
Failures cannot be legislated against, focus on fast detection and response (quote from a VC2 founder)
Resilience is not just about design and architecture, it's also about how the application and team responds to failures
Failure Scenarios
Hard failures (e.g., AZ outage) are easy to detect and recover from using managed services
Gray failures (e.g., application-level issues) are harder to detect and recover from, require better observability
Key Focus Areas for Resilience
Fault Isolation :
Physical boundaries (availability zones, regions, global services)
Logical boundaries (microservices, AWS accounts, "cellular" architecture)
Observability :
Granular metrics and dimensions to detect issues early
Composite alarms to provide a holistic view of application health
Proactive operational reviews to continuously improve
Recovery :
Static stability (pre-provisioned capacity, avoiding new runtime dependencies)
Retrying with exponential backoff and jitter
Routing around failures using deep health checks
Application Recovery Control (ARC) for regional and zonal traffic shifting
Aligning deployment and recovery actions
Capital One's Resilience Approaches
Authorization Platform :
Zonal-independent cell architecture for low-latency, high-throughput transactions
Deep health checks and a resilience engine to monitor and route traffic
Fallback to degraded mode to maintain customer experience
Core Banking Platform :
Regional cell-based architecture for tenant isolation and dynamic scaling
Fault isolation to enable independent cell operations and prevent retry storms
Leveraging AWS services (Route 53, ELB) for native failover capabilities
Key Takeaways
Resilience is a continuous journey, not a one-size-fits-all solution
Deep observability and understanding of business workflows are crucial
Aligning application design, deployment, and recovery actions is essential
Capital One's examples showcase how resilience can be implemented in mission-critical financial applications
Your Digital Journey deserves a great story. Build one with us.