Building resilient applications on AWS with Capital One (ARC334)

Here is a detailed summary of the video transcription in markdown format, broken down into sections for better readability:

Guiding Principles

  • Everything fails all the time (quote from AWS CTO)
  • Failures cannot be legislated against, focus on fast detection and response (quote from a VC2 founder)
  • Resilience is not just about design and architecture, it's also about how the application and team responds to failures

Failure Scenarios

  • Hard failures (e.g., AZ outage) are easy to detect and recover from using managed services
  • Gray failures (e.g., application-level issues) are harder to detect and recover from, require better observability

Key Focus Areas for Resilience

  1. Fault Isolation:

    • Physical boundaries (availability zones, regions, global services)
    • Logical boundaries (microservices, AWS accounts, "cellular" architecture)
  2. Observability:

    • Granular metrics and dimensions to detect issues early
    • Composite alarms to provide a holistic view of application health
    • Proactive operational reviews to continuously improve
  3. Recovery:

    • Static stability (pre-provisioned capacity, avoiding new runtime dependencies)
    • Retrying with exponential backoff and jitter
    • Routing around failures using deep health checks
    • Application Recovery Control (ARC) for regional and zonal traffic shifting
    • Aligning deployment and recovery actions

Capital One's Resilience Approaches

  1. Authorization Platform:

    • Zonal-independent cell architecture for low-latency, high-throughput transactions
    • Deep health checks and a resilience engine to monitor and route traffic
    • Fallback to degraded mode to maintain customer experience
  2. Core Banking Platform:

    • Regional cell-based architecture for tenant isolation and dynamic scaling
    • Fault isolation to enable independent cell operations and prevent retry storms
    • Leveraging AWS services (Route 53, ELB) for native failover capabilities

Key Takeaways

  • Resilience is a continuous journey, not a one-size-fits-all solution
  • Deep observability and understanding of business workflows are crucial
  • Aligning application design, deployment, and recovery actions is essential
  • Capital One's examples showcase how resilience can be implemented in mission-critical financial applications

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us