AWS re:Invent 2025 - Drive operational excellence for modern applications (COP327)

Driving Operational Excellence for Modern Applications

Challenges with Modern Applications

Monolithic applications were hard to scale, had tightly coupled dependencies, and made it difficult for teams to work independently

Modern applications built on microservices, serverless, and short-lived resources provide more agility and flexibility

However, the distributed nature of modern applications makes them more complex to observe and troubleshoot

Must-Haves for Operational Excellence

Metrics

Start with business metrics that indicate application performance and impact to the business

Use "golden" or "red" metrics like availability, requests, errors, and duration to measure application health

Add metadata (dimensions/labels) to provide context around metrics

Leverage tags to query vendor-provided metrics

Continuously improve by reducing detection and resolution time through "correction of errors" process

Alarms and Notifications

Use metric insights to create one alarm for many metrics, controlling the scope and data

Aggregate metrics into a single series for one notification or keep them independent

Include alarm descriptions with relevant information and links

Distributed Tracing

Tracing is essential for understanding dependencies and identifying issues in distributed applications

Use open standards like OpenTelemetry for vendor-neutral, interoperable tracing

Leverage auto-instrumentation to avoid code changes

Add manual instrumentation for business-specific context

Correlate traces with logs and other telemetry

Control trace volume through sampling techniques like head sampling or tail sampling

Filling Gaps in Telemetry

Utilize specialized tools like:

Synthetic canaries for customer-facing visibility
Container Insights for container-based workloads
Network Flow Logs for network performance

Leverage AI and machine learning capabilities:

Anomaly detection for logs and metrics
Cloudwatch Investigations for automated root cause analysis
Amazon Managed Service for Prometheus (AMP) for AI-powered insights

Key Takeaways

Start with business-relevant metrics and add context through tags and dimensions

Implement distributed tracing using OpenTelemetry for visibility into modern, distributed applications

Continuously improve observability by reducing detection and resolution time

Fill gaps in telemetry using specialized tools and AI/ML capabilities

Focus on operational excellence to ensure modern applications meet business requirements

Resources

Observability workshop for hands-on activities

Observability best practices guide

Kiosks in the expo village for further discussions

AWS re:Invent 2025 - Drive operational excellence for modern applications (COP327)

Driving Operational Excellence for Modern Applications

Challenges with Modern Applications

Must-Haves for Operational Excellence

Metrics

Alarms and Notifications

Distributed Tracing

Filling Gaps in Telemetry

Key Takeaways

Resources

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Drive operational excellence for modern applications (COP327)

Driving Operational Excellence for Modern Applications

Challenges with Modern Applications

Must-Haves for Operational Excellence

Metrics

Alarms and Notifications

Distributed Tracing

Filling Gaps in Telemetry

Key Takeaways

Resources

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.