TalksAWS re:Invent 2025 - Drive operational excellence for modern applications (COP327)

AWS re:Invent 2025 - Drive operational excellence for modern applications (COP327)

Driving Operational Excellence for Modern Applications

Challenges with Modern Applications

  • Monolithic applications were hard to scale, had tightly coupled dependencies, and made it difficult for teams to work independently
  • Modern applications built on microservices, serverless, and short-lived resources provide more agility and flexibility
  • However, the distributed nature of modern applications makes them more complex to observe and troubleshoot

Must-Haves for Operational Excellence

Metrics

  • Start with business metrics that indicate application performance and impact to the business
  • Use "golden" or "red" metrics like availability, requests, errors, and duration to measure application health
  • Add metadata (dimensions/labels) to provide context around metrics
  • Leverage tags to query vendor-provided metrics
  • Continuously improve by reducing detection and resolution time through "correction of errors" process

Alarms and Notifications

  • Use metric insights to create one alarm for many metrics, controlling the scope and data
  • Aggregate metrics into a single series for one notification or keep them independent
  • Include alarm descriptions with relevant information and links

Distributed Tracing

  • Tracing is essential for understanding dependencies and identifying issues in distributed applications
  • Use open standards like OpenTelemetry for vendor-neutral, interoperable tracing
  • Leverage auto-instrumentation to avoid code changes
  • Add manual instrumentation for business-specific context
  • Correlate traces with logs and other telemetry
  • Control trace volume through sampling techniques like head sampling or tail sampling

Filling Gaps in Telemetry

  • Utilize specialized tools like:
    • Synthetic canaries for customer-facing visibility
    • Container Insights for container-based workloads
    • Network Flow Logs for network performance
  • Leverage AI and machine learning capabilities:
    • Anomaly detection for logs and metrics
    • Cloudwatch Investigations for automated root cause analysis
    • Amazon Managed Service for Prometheus (AMP) for AI-powered insights

Key Takeaways

  • Start with business-relevant metrics and add context through tags and dimensions
  • Implement distributed tracing using OpenTelemetry for visibility into modern, distributed applications
  • Continuously improve observability by reducing detection and resolution time
  • Fill gaps in telemetry using specialized tools and AI/ML capabilities
  • Focus on operational excellence to ensure modern applications meet business requirements

Resources

  • Observability workshop for hands-on activities
  • Observability best practices guide
  • Kiosks in the expo village for further discussions

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.