Talks AWS re:Invent 2025 - Resilience testing and AWS Lambda actions under the hood (COP414) VIDEO
AWS re:Invent 2025 - Resilience testing and AWS Lambda actions under the hood (COP414) AWS re:Invent 2025 - Resilience Testing and AWS Lambda Actions Under the Hood
Introduction to Chaos Engineering and AWS Fault Injection Service (FIS)
Chaos engineering is an approach to building confidence in a system's ability to withstand failures and disruptions
It involves intentionally introducing faults and failures to validate system resilience
AWS Fault Injection Service (FIS) is a fully managed chaos engineering service that helps customers get started quickly
Key benefits of FIS:
Serverless and fully managed - no infrastructure to provision
Provides a library of pre-defined failure scenarios
Integrates natively with AWS services like EC2, enabling actions like pausing autoscaling
Provides controls and guardrails to introduce faults in a controlled manner
AWS Lambda and the Need for Resilience Testing
AWS Lambda is a serverless compute service that abstracts away infrastructure management
Lambda functions are designed to be resilient by leveraging multiple Availability Zones (AZs)
However, Lambda functions often integrate with other systems, so overall system resilience still needs to be validated
Chaos engineering is a recommended approach to test the resilience of Lambda-based architectures
FIS Actions for Testing Lambda Function Resilience
FIS recently introduced three native actions for testing Lambda function resilience:
Add Start Delay : Introduces a delay in the start of a Lambda function invocation
Modify Integration Response : Allows testing the impact of a Lambda function returning incorrect responses
Invocation Errors : Tests the impact of a Lambda function being marked as failed during invocation
These actions leverage the concept of Lambda extensions to inject faults into the Lambda execution lifecycle
How FIS Extensions Integrate with Lambda
FIS provides a custom extension that integrates with the Lambda runtime environment
The extension uses an API proxy pattern to hook into the Lambda function invocation lifecycle
When an experiment is configured in FIS, the extension polls an S3 bucket for the active fault configuration
The extension then applies the configured faults during Lambda function invocations
The extension uses a dual-mode polling mechanism (fast and slow) to balance performance and quick recovery
Demonstration and Walkthrough
The presenters demonstrated a sample application architecture using Lambda, DynamoDB, and API Gateway
They walked through the AWS CDK code used to deploy the application and set up the FIS experiments
Two experiment templates were created:
Inject a 2-second startup delay for 5 minutes on all tagged Lambda functions
Inject invocation errors on all tagged Lambda functions
The experiments were executed, and the presenters analyzed the impact on metrics captured in a CloudWatch dashboard
Key Takeaways and Resources
FIS provides a powerful and easy-to-use platform for implementing chaos engineering on serverless architectures
The new Lambda-specific actions (start delay, integration response modification, invocation errors) enable comprehensive resilience testing
Observability and metrics collection are critical for understanding the impact of chaos experiments
The presenters provided a GitHub repository with sample code and a link to the AWS Resilience Hub for further resources
Your Digital Journey deserves a great story. Build one with us.