TalksAWS re:Invent 2025 - The Art of Embracing Failures in Serverless Architectures (DEV312)
AWS re:Invent 2025 - The Art of Embracing Failures in Serverless Architectures (DEV312)
Embracing Failures in Serverless Architectures
Understanding the Complexity of Distributed Systems
Distributed systems are composed of interconnected machines that can fail independently in non-deterministic ways.
Cloud providers abstract away the complexity of distributed systems, but the underlying failures still exist.
Serverless architectures further hide the infrastructure, creating a false sense of security around failures.
Distributed applications mirror the complexity of the underlying distributed systems, inheriting the same trade-offs and failure modes.
The Hidden Superpowers: Timeouts and Retries
Timeouts and retries are essential tools for building resilient distributed applications, but they must be used carefully.
Timeouts prevent applications from waiting indefinitely for a response, but setting the right timeout values is crucial.
Retries can help recover from transient failures, but they can also amplify and spread failures if not properly configured.
AWS SDKs provide built-in retry mechanisms, but the default settings may not be optimal and can lead to cascading failures.
Avoiding the Pitfalls of Timeouts and Retries
Configuring timeouts and retries is essential to prevent resource exhaustion and system outages.
Retrying too aggressively or without proper backoff can overload the underlying systems, leading to a "match made in hell" scenario.
Partial failures in batch operations, such as Kinesis put records, require careful handling to avoid data loss.
Blindly relying on default settings can be dangerous and lead to unexpected consequences.
Handling Service Limits and Throttling
Serverless services have inherent limits and throttling mechanisms to prevent resource monopolization.
Exceeding these limits can lead to partial failures and data loss, which must be properly handled.
The Lambda event source mapping feature has built-in error handling capabilities that can be leveraged to mitigate the impact of failures.
Configuring the appropriate settings, such as retry limits and dead-letter queues, is crucial to maintain system resilience.
Embracing Failures and Building Resilience
Distributed systems and architectures are inherently complex, and failures are inevitable.
Instead of avoiding failures, the key is to embrace them and build systems that can gracefully handle and recover from failures.
Understanding the fundamental concepts of distributed systems, such as timeouts, retries, and service limits, is essential to making informed decisions.
Blindly relying on defaults can lead to disastrous consequences, so taking control and configuring these parameters is crucial.
Failures are opportunities to learn and improve the resilience of the system, rather than something to be feared.
Key Takeaways
Distributed systems and serverless architectures are inherently complex and prone to failures.
Timeouts and retries are powerful tools for building resilience, but they must be used carefully to avoid amplifying failures.
Configuring appropriate timeout values and retry strategies is essential to prevent resource exhaustion and system outages.
Handling partial failures, service limits, and throttling is crucial to maintain data integrity and system availability.
Embracing failures and building systems that can gracefully recover is the key to developing resilient distributed applications.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.