Building Resilient SaaS: Multi-Account Resilience Testing Patterns
Understanding SaaS Architecture Fundamentals
SaaS providers must adopt 6 critical pillars for building resilient SaaS architectures:
Robust tenant isolation
Noisy neighbor mitigation
Comprehensive identity and access management
Tenant-aware observability
Strategic tiering
Cost-aware tracking mechanism
These fundamentals align with the AWS Well-Architected Reliability pillar, which emphasizes testing reliability through failure management.
Resilience Life Cycle Framework
AWS recommends a 5-stage resilience life cycle framework:
Set objectives: Understand required resiliency levels (RTO, RPO, availability SLAs)
Design and implement: Anticipate failure modes and adopt appropriate tools
Evaluate and test: Perform resilience testing with controlled fault injection
Operate: Instrument observability tools, logs, and metrics
Respond and learn: Establish mitigation strategies and continuously improve
Resilience Experiment Methodology
Establish a steady state baseline to understand normal system behavior
Define a hypothesis about potential weaknesses or assumptions to validate
Run controlled experiments using AWS Fault Injection Service (FIS) to induce faults
Evaluate the results and make improvements to architecture, runbooks, or procedures
AWS Fault Injection Service (FIS)
FIS is a fully managed service for eliminating the overhead of building custom fault injection scripts
Configure experiment templates with actions, targets, and safeguards to control the blast radius
Leverage pre-built scenario libraries or create custom experiments
Integrate with third-party observability tools via Amazon EventBridge
SaaS Reference Architecture
SaaS providers should separate control plane (tenant management, billing, authentication) and application plane (tenant-specific services)
Example SaaS offerings:
SaaS e-commerce solution with tenant-specific product and order management
SaaS retrieval-augmented generation with tenant-isolated data and LLM access
Resilience Testing Patterns
Pattern 1: Multi-Tenant Noisy Neighbor
Hypothesis: Induce faults to disable event bridge rule and delete CloudWatch log streams to validate tenant throttling limits
Observed that tenants could bypass throttling limits after faults were injected, highlighting the need for:
IAM least-privilege mechanisms to restrict actions that could disable monitoring
Broader mitigation strategies for noisy neighbor scenarios beyond just tenant-level throttling
Pattern 2: Tenant Isolation
Hypothesis: Inject a microservice bug to test if a tenant can access another tenant's data
Leveraged attribute-based access control (ABAC) to dynamically generate tenant-scoped credentials, validating the effectiveness of the isolation strategy
Pattern 3: Serverless Application Resilience
Induce AWS Lambda faults to add latency and observe impact on user experience, API Gateway error rates, and other system dependencies
Pattern 4: EKS High Availability
Leverage FIS to inject faults like terminating pods, inducing CPU/IO stress on EKS clusters to validate resilience
Key Takeaways
Understand your end-to-end workload architecture to identify key dependencies and weaknesses
Define clear objectives and hypotheses for resilience testing to align with business outcomes
Adopt a continuous, iterative approach to resilience testing as part of your CI/CD pipeline
Leverage AWS Fault Injection Service to inject controlled faults across multi-account setups
Continuously improve your resilience posture based on learnings from experiments
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.