TalksAWS re:Invent 2025 - Resilience testing and AWS Lambda actions under the hood (COP414)

AWS re:Invent 2025 - Resilience testing and AWS Lambda actions under the hood (COP414)

AWS re:Invent 2025 - Resilience Testing and AWS Lambda Actions Under the Hood

Introduction to Chaos Engineering and AWS Fault Injection Service (FIS)

  • Chaos engineering is an approach to building confidence in a system's ability to withstand failures and disruptions
  • It involves intentionally introducing faults and failures to validate system resilience
  • AWS Fault Injection Service (FIS) is a fully managed chaos engineering service that helps customers get started quickly
  • Key benefits of FIS:
    • Serverless and fully managed - no infrastructure to provision
    • Provides a library of pre-defined failure scenarios
    • Integrates natively with AWS services like EC2, enabling actions like pausing autoscaling
    • Provides controls and guardrails to introduce faults in a controlled manner

AWS Lambda and the Need for Resilience Testing

  • AWS Lambda is a serverless compute service that abstracts away infrastructure management
  • Lambda functions are designed to be resilient by leveraging multiple Availability Zones (AZs)
  • However, Lambda functions often integrate with other systems, so overall system resilience still needs to be validated
  • Chaos engineering is a recommended approach to test the resilience of Lambda-based architectures

FIS Actions for Testing Lambda Function Resilience

  • FIS recently introduced three native actions for testing Lambda function resilience:
    1. Add Start Delay: Introduces a delay in the start of a Lambda function invocation
    2. Modify Integration Response: Allows testing the impact of a Lambda function returning incorrect responses
    3. Invocation Errors: Tests the impact of a Lambda function being marked as failed during invocation
  • These actions leverage the concept of Lambda extensions to inject faults into the Lambda execution lifecycle

How FIS Extensions Integrate with Lambda

  • FIS provides a custom extension that integrates with the Lambda runtime environment
  • The extension uses an API proxy pattern to hook into the Lambda function invocation lifecycle
  • When an experiment is configured in FIS, the extension polls an S3 bucket for the active fault configuration
  • The extension then applies the configured faults during Lambda function invocations
  • The extension uses a dual-mode polling mechanism (fast and slow) to balance performance and quick recovery

Demonstration and Walkthrough

  • The presenters demonstrated a sample application architecture using Lambda, DynamoDB, and API Gateway
  • They walked through the AWS CDK code used to deploy the application and set up the FIS experiments
  • Two experiment templates were created:
    1. Inject a 2-second startup delay for 5 minutes on all tagged Lambda functions
    2. Inject invocation errors on all tagged Lambda functions
  • The experiments were executed, and the presenters analyzed the impact on metrics captured in a CloudWatch dashboard

Key Takeaways and Resources

  • FIS provides a powerful and easy-to-use platform for implementing chaos engineering on serverless architectures
  • The new Lambda-specific actions (start delay, integration response modification, invocation errors) enable comprehensive resilience testing
  • Observability and metrics collection are critical for understanding the impact of chaos experiments
  • The presenters provided a GitHub repository with sample code and a link to the AWS Resilience Hub for further resources

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.