TalksAWS re:Invent 2025 - Build resilient SaaS: multi-account resilience testing patterns (ISV404)

AWS re:Invent 2025 - Build resilient SaaS: multi-account resilience testing patterns (ISV404)

Building Resilient SaaS: Multi-Account Resilience Testing Patterns

Understanding SaaS Architecture Fundamentals

  • SaaS providers must adopt 6 critical pillars for building resilient SaaS architectures:
    1. Robust tenant isolation
    2. Noisy neighbor mitigation
    3. Comprehensive identity and access management
    4. Tenant-aware observability
    5. Strategic tiering
    6. Cost-aware tracking mechanism
  • These fundamentals align with the AWS Well-Architected Reliability pillar, which emphasizes testing reliability through failure management.

Resilience Life Cycle Framework

  • AWS recommends a 5-stage resilience life cycle framework:
    1. Set objectives: Understand required resiliency levels (RTO, RPO, availability SLAs)
    2. Design and implement: Anticipate failure modes and adopt appropriate tools
    3. Evaluate and test: Perform resilience testing with controlled fault injection
    4. Operate: Instrument observability tools, logs, and metrics
    5. Respond and learn: Establish mitigation strategies and continuously improve

Resilience Experiment Methodology

  • Establish a steady state baseline to understand normal system behavior
  • Define a hypothesis about potential weaknesses or assumptions to validate
  • Run controlled experiments using AWS Fault Injection Service (FIS) to induce faults
  • Evaluate the results and make improvements to architecture, runbooks, or procedures

AWS Fault Injection Service (FIS)

  • FIS is a fully managed service for eliminating the overhead of building custom fault injection scripts
  • Configure experiment templates with actions, targets, and safeguards to control the blast radius
  • Leverage pre-built scenario libraries or create custom experiments
  • Integrate with third-party observability tools via Amazon EventBridge

SaaS Reference Architecture

  • SaaS providers should separate control plane (tenant management, billing, authentication) and application plane (tenant-specific services)
  • Example SaaS offerings:
    1. SaaS e-commerce solution with tenant-specific product and order management
    2. SaaS retrieval-augmented generation with tenant-isolated data and LLM access

Resilience Testing Patterns

Pattern 1: Multi-Tenant Noisy Neighbor

  • Hypothesis: Induce faults to disable event bridge rule and delete CloudWatch log streams to validate tenant throttling limits
  • Observed that tenants could bypass throttling limits after faults were injected, highlighting the need for:
    • IAM least-privilege mechanisms to restrict actions that could disable monitoring
    • Broader mitigation strategies for noisy neighbor scenarios beyond just tenant-level throttling

Pattern 2: Tenant Isolation

  • Hypothesis: Inject a microservice bug to test if a tenant can access another tenant's data
  • Leveraged attribute-based access control (ABAC) to dynamically generate tenant-scoped credentials, validating the effectiveness of the isolation strategy

Pattern 3: Serverless Application Resilience

  • Induce AWS Lambda faults to add latency and observe impact on user experience, API Gateway error rates, and other system dependencies

Pattern 4: EKS High Availability

  • Leverage FIS to inject faults like terminating pods, inducing CPU/IO stress on EKS clusters to validate resilience

Key Takeaways

  • Understand your end-to-end workload architecture to identify key dependencies and weaknesses
  • Define clear objectives and hypotheses for resilience testing to align with business outcomes
  • Adopt a continuous, iterative approach to resilience testing as part of your CI/CD pipeline
  • Leverage AWS Fault Injection Service to inject controlled faults across multi-account setups
  • Continuously improve your resilience posture based on learnings from experiments

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.