TalksAWS re:Invent 2025 - Building Fault-Tolerant Systems with AWS Messaging Services (API322)

AWS re:Invent 2025 - Building Fault-Tolerant Systems with AWS Messaging Services (API322)

Building Fault-Tolerant Systems with AWS Messaging Services

Overview

Presentation on architectural patterns and best practices for building resilient, fault-tolerant messaging systems using AWS services
Covers common challenges and failure scenarios in messaging architectures
Demonstrates how to implement patterns like Dead Letter Queues, Retry with Backoff, Circuit Breakers, and Saga Orchestration

Key Messaging Services on AWS

AWS native services: SQS, SNS, Kinesis, EventBridge
Managed open-source services: Amazon MQ (RabbitMQ, ActiveMQ), Amazon MSK (Kafka)
Most attendees were using AWS native messaging services

Resilience Challenges in Messaging Architectures

Potential failure points: Producer, Messaging Service, Consumer
Risks include message loss, message duplication, inconsistent state

Dead Letter Queue (DLQ) Pattern

Configures a secondary queue to capture messages that fail processing
Allows manual intervention and reprocessing of failed messages
Prevents message loss and "poison pill" messages from disrupting the main queue
Easily implemented by defining a DLQ and configuring retry policies in SQS

Retry with Exponential Backoff

Handles transient errors by retrying messages with increasing delay
Gives downstream systems time to recover and avoids overloading
Helps prevent cascading failures and "message storms"
Many AWS services (SNS, Lambda) have this built-in

Circuit Breaker Pattern

Detects persistent failures in downstream dependencies
Opens the "circuit" to prevent repeated failed attempts
Allows time for the failing system to recover
Implemented using a state machine (e.g. AWS Step Functions) to monitor and control the circuit state

Saga Orchestration

Handles distributed transactions across multiple consumers
Ensures atomicity and consistency of the overall process
Uses a central coordinator (e.g. Step Functions) to manage the compensation logic
Enables self-healing by automatically retrying or rolling back failed steps

Redundancy and High Availability

Multi-region, multi-AZ configurations for messaging services
Active-standby broker pairs (e.g. Amazon MQ) for failover
Ensures zero downtime and continuous message processing

Key Takeaways

Design for failure - build systems that can gracefully handle and recover from failures
Leverage AWS messaging services and patterns to increase resilience and fault tolerance
Implement DLQs, retries, circuit breakers, and saga orchestration to handle different failure scenarios
Monitor and observe system behavior to optimize for cost and performance
Test failure scenarios in non-production environments to validate resilience

Technical Details

Demonstrated a coffee ordering system built on AWS services (API Gateway, SQS, Lambda)
Injected faults to simulate payment system failures and observed system behavior
Implemented DLQ, circuit breaker, and automated redrive functionality
Monitored metrics in CloudWatch to analyze cost and performance impact

Business Impact

Increased reliability and availability of mission-critical messaging systems
Reduced operational overhead and on-call incidents due to messaging failures
Enabled self-healing and automatic recovery, improving customer experience
Optimized costs by preventing "message storms" and unnecessary compute usage

Examples

Coffee ordering system experienced payment system failures, leading to messages being retried indefinitely
DLQ pattern captured failed messages, allowing for manual intervention and reprocessing
Circuit breaker pattern detected persistent failures and prevented further attempts, avoiding resource exhaustion
Automated redrive functionality detected when the payment system was healthy again and reprocessed the backlog

Your Digital Journey deserves a great story.

Build one with us.

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

AWS re:Invent 2025 - Building Fault-Tolerant Systems with AWS Messaging Services (API322)

Building Fault-Tolerant Systems with AWS Messaging Services

Overview

Key Messaging Services on AWS

Resilience Challenges in Messaging Architectures

Dead Letter Queue (DLQ) Pattern

Retry with Exponential Backoff

Circuit Breaker Pattern

Saga Orchestration

Redundancy and High Availability

Key Takeaways

Technical Details

Business Impact

Examples

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Building Fault-Tolerant Systems with AWS Messaging Services (API322)

Building Fault-Tolerant Systems with AWS Messaging Services

Overview

Key Messaging Services on AWS

Resilience Challenges in Messaging Architectures

Dead Letter Queue (DLQ) Pattern

Retry with Exponential Backoff

Circuit Breaker Pattern

Saga Orchestration

Redundancy and High Availability

Key Takeaways

Technical Details

Business Impact

Examples

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.