AWS re:Invent 2025 - The Art of Embracing Failures in Serverless Architectures (DEV312)

Embracing Failures in Serverless Architectures

Understanding the Complexity of Distributed Systems

Distributed systems are composed of interconnected machines that can fail independently in non-deterministic ways.

Cloud providers abstract away the complexity of distributed systems, but the underlying failures still exist.

Serverless architectures further hide the infrastructure, creating a false sense of security around failures.

Distributed applications mirror the complexity of the underlying distributed systems, inheriting the same trade-offs and failure modes.

The Hidden Superpowers: Timeouts and Retries

Timeouts and retries are essential tools for building resilient distributed applications, but they must be used carefully.

Timeouts prevent applications from waiting indefinitely for a response, but setting the right timeout values is crucial.

Retries can help recover from transient failures, but they can also amplify and spread failures if not properly configured.

AWS SDKs provide built-in retry mechanisms, but the default settings may not be optimal and can lead to cascading failures.

Avoiding the Pitfalls of Timeouts and Retries

Configuring timeouts and retries is essential to prevent resource exhaustion and system outages.

Retrying too aggressively or without proper backoff can overload the underlying systems, leading to a "match made in hell" scenario.

Partial failures in batch operations, such as Kinesis put records, require careful handling to avoid data loss.

Blindly relying on default settings can be dangerous and lead to unexpected consequences.

Handling Service Limits and Throttling

Serverless services have inherent limits and throttling mechanisms to prevent resource monopolization.

Exceeding these limits can lead to partial failures and data loss, which must be properly handled.

The Lambda event source mapping feature has built-in error handling capabilities that can be leveraged to mitigate the impact of failures.

Configuring the appropriate settings, such as retry limits and dead-letter queues, is crucial to maintain system resilience.

Embracing Failures and Building Resilience

Distributed systems and architectures are inherently complex, and failures are inevitable.

Instead of avoiding failures, the key is to embrace them and build systems that can gracefully handle and recover from failures.

Understanding the fundamental concepts of distributed systems, such as timeouts, retries, and service limits, is essential to making informed decisions.

Blindly relying on defaults can lead to disastrous consequences, so taking control and configuring these parameters is crucial.

Failures are opportunities to learn and improve the resilience of the system, rather than something to be feared.

Key Takeaways

Distributed systems and serverless architectures are inherently complex and prone to failures.

Timeouts and retries are powerful tools for building resilience, but they must be used carefully to avoid amplifying failures.

Configuring appropriate timeout values and retry strategies is essential to prevent resource exhaustion and system outages.

Handling partial failures, service limits, and throttling is crucial to maintain data integrity and system availability.

Embracing failures and building systems that can gracefully recover is the key to developing resilient distributed applications.

AWS re:Invent 2025 - The Art of Embracing Failures in Serverless Architectures (DEV312)

Embracing Failures in Serverless Architectures

Understanding the Complexity of Distributed Systems

The Hidden Superpowers: Timeouts and Retries

Avoiding the Pitfalls of Timeouts and Retries

Handling Service Limits and Throttling

Embracing Failures and Building Resilience

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - The Art of Embracing Failures in Serverless Architectures (DEV312)

Embracing Failures in Serverless Architectures

Understanding the Complexity of Distributed Systems

The Hidden Superpowers: Timeouts and Retries

Avoiding the Pitfalls of Timeouts and Retries

Handling Service Limits and Throttling

Embracing Failures and Building Resilience

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.