TalksAWS re:Invent 2025 - Building Fault-Tolerant Systems with AWS Messaging Services (API322)

AWS re:Invent 2025 - Building Fault-Tolerant Systems with AWS Messaging Services (API322)

Building Fault-Tolerant Systems with AWS Messaging Services

Overview

  • Presentation on architectural patterns and best practices for building resilient, fault-tolerant messaging systems using AWS services
  • Covers common challenges and failure scenarios in messaging architectures
  • Demonstrates how to implement patterns like Dead Letter Queues, Retry with Backoff, Circuit Breakers, and Saga Orchestration

Key Messaging Services on AWS

  • AWS native services: SQS, SNS, Kinesis, EventBridge
  • Managed open-source services: Amazon MQ (RabbitMQ, ActiveMQ), Amazon MSK (Kafka)
  • Most attendees were using AWS native messaging services

Resilience Challenges in Messaging Architectures

  • Potential failure points: Producer, Messaging Service, Consumer
  • Risks include message loss, message duplication, inconsistent state

Dead Letter Queue (DLQ) Pattern

  • Configures a secondary queue to capture messages that fail processing
  • Allows manual intervention and reprocessing of failed messages
  • Prevents message loss and "poison pill" messages from disrupting the main queue
  • Easily implemented by defining a DLQ and configuring retry policies in SQS

Retry with Exponential Backoff

  • Handles transient errors by retrying messages with increasing delay
  • Gives downstream systems time to recover and avoids overloading
  • Helps prevent cascading failures and "message storms"
  • Many AWS services (SNS, Lambda) have this built-in

Circuit Breaker Pattern

  • Detects persistent failures in downstream dependencies
  • Opens the "circuit" to prevent repeated failed attempts
  • Allows time for the failing system to recover
  • Implemented using a state machine (e.g. AWS Step Functions) to monitor and control the circuit state

Saga Orchestration

  • Handles distributed transactions across multiple consumers
  • Ensures atomicity and consistency of the overall process
  • Uses a central coordinator (e.g. Step Functions) to manage the compensation logic
  • Enables self-healing by automatically retrying or rolling back failed steps

Redundancy and High Availability

  • Multi-region, multi-AZ configurations for messaging services
  • Active-standby broker pairs (e.g. Amazon MQ) for failover
  • Ensures zero downtime and continuous message processing

Key Takeaways

  1. Design for failure - build systems that can gracefully handle and recover from failures
  2. Leverage AWS messaging services and patterns to increase resilience and fault tolerance
  3. Implement DLQs, retries, circuit breakers, and saga orchestration to handle different failure scenarios
  4. Monitor and observe system behavior to optimize for cost and performance
  5. Test failure scenarios in non-production environments to validate resilience

Technical Details

  • Demonstrated a coffee ordering system built on AWS services (API Gateway, SQS, Lambda)
  • Injected faults to simulate payment system failures and observed system behavior
  • Implemented DLQ, circuit breaker, and automated redrive functionality
  • Monitored metrics in CloudWatch to analyze cost and performance impact

Business Impact

  • Increased reliability and availability of mission-critical messaging systems
  • Reduced operational overhead and on-call incidents due to messaging failures
  • Enabled self-healing and automatic recovery, improving customer experience
  • Optimized costs by preventing "message storms" and unnecessary compute usage

Examples

  • Coffee ordering system experienced payment system failures, leading to messages being retried indefinitely
  • DLQ pattern captured failed messages, allowing for manual intervention and reprocessing
  • Circuit breaker pattern detected persistent failures and prevented further attempts, avoiding resource exhaustion
  • Automated redrive functionality detected when the payment system was healthy again and reprocessed the backlog

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.