TalksAWS re:Invent 2025 - Solving the Observability Mystery with AWS Step Functions (API321)
AWS re:Invent 2025 - Solving the Observability Mystery with AWS Step Functions (API321)
Solving the Observability Mystery with AWS Step Functions
Building a Scalable Wind Speed Analysis Workflow
Demonstrated building a serverless workflow using AWS Step Functions to analyze a large dataset of global wind speed data
Leveraged the Distributed Map state to efficiently process over 600,000 objects stored in S3
Configured the Distributed Map with:
Batching of 500 objects per iteration
Concurrency limit of 1,000 parallel executions
Tolerant failure threshold of 5%
Output to a separate S3 bucket
Included Lambda functions to:
Analyze the wind speed data and calculate the mean
Consolidate and generate the final output
Convert the wind speed units from knots to miles per hour
Observability and Monitoring for Step Functions
Discussed the importance of observability when working with asynchronous, distributed workflows
Highlighted the new metrics recently launched by AWS for Step Functions:
Open Map Run Limit: The maximum number of concurrent map runs allowed
Open Map Run Count: The current number of open map runs
Map Run Backlog Size: The number of map runs waiting to be executed
Demonstrated monitoring these metrics using Amazon CloudWatch and setting alarms to proactively identify issues
Explained how the metrics can help identify when the Step Functions service is throttling the workflow due to hitting the state transition limit
Debugging Cross-Account Integrations
Explored a scenario where a parent Step Functions workflow invokes a child workflow in a different AWS account
Highlighted the importance of establishing the correct trust relationship and IAM permissions between the accounts
Demonstrated how the parent workflow can get stuck waiting for the child workflow to complete due to a lack of "describe execution" permissions
Explained the backup polling mechanism used by Step Functions to handle cases where events are not delivered, and how this can cause delays
Recommended adding the "describe execution" and "stop execution" permissions to the child workflow's IAM role to enable the parent workflow to properly monitor and manage the child execution
Key Takeaways
AWS Step Functions provides powerful capabilities for building scalable, observability-focused serverless workflows
Monitoring the new Step Functions metrics, such as open map runs and state transition throttling, is crucial for proactive issue identification and resolution
Careful planning of cross-account integrations, including IAM permissions and trust relationships, is essential to ensure smooth workflow execution and observability
Technical Details
AWS Step Functions
Distributed Map state
Lambda functions
Amazon S3
Amazon CloudWatch metrics and alarms
IAM roles and permissions
Business Impact
Enables the processing and analysis of large, distributed datasets in a scalable, serverless manner
Provides deep visibility into the execution of complex, asynchronous workflows to quickly identify and resolve issues
Facilitates seamless integration between different AWS services and accounts, unlocking new opportunities for collaboration and reuse
Examples
Wind speed data analysis workflow processing over 600,000 objects
Monitoring Step Functions metrics to identify and address state transition throttling
Troubleshooting a cross-account integration issue caused by missing IAM permissions
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.