Talks AWS re:Invent 2025 - Building Fault-Tolerant Systems with AWS Messaging Services (API322) VIDEO
AWS re:Invent 2025 - Building Fault-Tolerant Systems with AWS Messaging Services (API322) Building Fault-Tolerant Systems with AWS Messaging Services
Overview
Presentation on architectural patterns and best practices for building resilient, fault-tolerant messaging systems using AWS services
Covers common challenges and failure scenarios in messaging architectures
Demonstrates how to implement patterns like Dead Letter Queues, Retry with Backoff, Circuit Breakers, and Saga Orchestration
Key Messaging Services on AWS
AWS native services: SQS, SNS, Kinesis, EventBridge
Managed open-source services: Amazon MQ (RabbitMQ, ActiveMQ), Amazon MSK (Kafka)
Most attendees were using AWS native messaging services
Resilience Challenges in Messaging Architectures
Potential failure points: Producer, Messaging Service, Consumer
Risks include message loss, message duplication, inconsistent state
Dead Letter Queue (DLQ) Pattern
Configures a secondary queue to capture messages that fail processing
Allows manual intervention and reprocessing of failed messages
Prevents message loss and "poison pill" messages from disrupting the main queue
Easily implemented by defining a DLQ and configuring retry policies in SQS
Retry with Exponential Backoff
Handles transient errors by retrying messages with increasing delay
Gives downstream systems time to recover and avoids overloading
Helps prevent cascading failures and "message storms"
Many AWS services (SNS, Lambda) have this built-in
Circuit Breaker Pattern
Detects persistent failures in downstream dependencies
Opens the "circuit" to prevent repeated failed attempts
Allows time for the failing system to recover
Implemented using a state machine (e.g. AWS Step Functions) to monitor and control the circuit state
Saga Orchestration
Handles distributed transactions across multiple consumers
Ensures atomicity and consistency of the overall process
Uses a central coordinator (e.g. Step Functions) to manage the compensation logic
Enables self-healing by automatically retrying or rolling back failed steps
Redundancy and High Availability
Multi-region, multi-AZ configurations for messaging services
Active-standby broker pairs (e.g. Amazon MQ) for failover
Ensures zero downtime and continuous message processing
Key Takeaways
Design for failure - build systems that can gracefully handle and recover from failures
Leverage AWS messaging services and patterns to increase resilience and fault tolerance
Implement DLQs, retries, circuit breakers, and saga orchestration to handle different failure scenarios
Monitor and observe system behavior to optimize for cost and performance
Test failure scenarios in non-production environments to validate resilience
Technical Details
Demonstrated a coffee ordering system built on AWS services (API Gateway, SQS, Lambda)
Injected faults to simulate payment system failures and observed system behavior
Implemented DLQ, circuit breaker, and automated redrive functionality
Monitored metrics in CloudWatch to analyze cost and performance impact
Business Impact
Increased reliability and availability of mission-critical messaging systems
Reduced operational overhead and on-call incidents due to messaging failures
Enabled self-healing and automatic recovery, improving customer experience
Optimized costs by preventing "message storms" and unnecessary compute usage
Examples
Coffee ordering system experienced payment system failures, leading to messages being retried indefinitely
DLQ pattern captured failed messages, allowing for manual intervention and reprocessing
Circuit breaker pattern detected persistent failures and prevented further attempts, avoiding resource exhaustion
Automated redrive functionality detected when the payment system was healthy again and reprocessed the backlog
Your Digital Journey deserves a great story. Build one with us.