Talks AWS re:Invent 2025 - Drive operational excellence for modern applications (COP327) VIDEO
AWS re:Invent 2025 - Drive operational excellence for modern applications (COP327) Driving Operational Excellence for Modern Applications
Challenges with Modern Applications
Monolithic applications were hard to scale, had tightly coupled dependencies, and made it difficult for teams to work independently
Modern applications built on microservices, serverless, and short-lived resources provide more agility and flexibility
However, the distributed nature of modern applications makes them more complex to observe and troubleshoot
Must-Haves for Operational Excellence
Metrics
Start with business metrics that indicate application performance and impact to the business
Use "golden" or "red" metrics like availability, requests, errors, and duration to measure application health
Add metadata (dimensions/labels) to provide context around metrics
Leverage tags to query vendor-provided metrics
Continuously improve by reducing detection and resolution time through "correction of errors" process
Alarms and Notifications
Use metric insights to create one alarm for many metrics, controlling the scope and data
Aggregate metrics into a single series for one notification or keep them independent
Include alarm descriptions with relevant information and links
Distributed Tracing
Tracing is essential for understanding dependencies and identifying issues in distributed applications
Use open standards like OpenTelemetry for vendor-neutral, interoperable tracing
Leverage auto-instrumentation to avoid code changes
Add manual instrumentation for business-specific context
Correlate traces with logs and other telemetry
Control trace volume through sampling techniques like head sampling or tail sampling
Filling Gaps in Telemetry
Utilize specialized tools like:
Synthetic canaries for customer-facing visibility
Container Insights for container-based workloads
Network Flow Logs for network performance
Leverage AI and machine learning capabilities:
Anomaly detection for logs and metrics
Cloudwatch Investigations for automated root cause analysis
Amazon Managed Service for Prometheus (AMP) for AI-powered insights
Key Takeaways
Start with business-relevant metrics and add context through tags and dimensions
Implement distributed tracing using OpenTelemetry for visibility into modern, distributed applications
Continuously improve observability by reducing detection and resolution time
Fill gaps in telemetry using specialized tools and AI/ML capabilities
Focus on operational excellence to ensure modern applications meet business requirements
Resources
Observability workshop for hands-on activities
Observability best practices guide
Kiosks in the expo village for further discussions
Your Digital Journey deserves a great story. Build one with us.