TalksAWS re:Invent 2025-Finding the Noisy Neighbor: Patterns for Per‑Customer Performance at Scale-MAM354

AWS re:Invent 2025-Finding the Noisy Neighbor: Patterns for Per‑Customer Performance at Scale-MAM354

Finding the Noisy Neighbor: Patterns for Per‑Customer Performance at Scale

Challenges of High Cardinality Infrastructure

  • Zenesk, an AWS partner, has an infrastructure with hundreds of thousands of customers, leading to high cardinality challenges:
    • Trying to monitor and troubleshoot issues for a small subset of customers on a shared infrastructure
    • Balancing observability, cost, and data sensitivity
    • Dealing with the "teeter totter" of cost and observability

Defining the "Noisy Neighbor" Problem

  • The "noisy neighbor" analogy - one customer's activity negatively impacting the performance for other customers on a shared infrastructure
  • Troubleshooting a customer incident with limited visibility into the underlying infrastructure

Improving Observability with Tagging and Tracing

  • Tagging infrastructure components (e.g., tenants, microservices) to gain more visibility into the customer's issue
  • Using APM (Application Performance Monitoring) tools like Datadog to trace the entire call flow and identify potential bottlenecks

The "REST" Approach to Observability

  1. Recognize: Identify the key metrics, logs, and infrastructure monitoring needed to detect issues
  2. Examine: Analyze the data to determine the root cause, such as hot partitions, backlogs, or latency spikes
  3. Shape: Implement controls and limits to protect customers, such as rate limiting or adjusting resource allocations
  4. Test: Regularly review the effectiveness of the observability setup and make iterative improvements

Optimizing Observability Costs

  • Reducing log ingestion and indexing by only capturing necessary data (e.g., errors, specific customer traces)
  • Leveraging less expensive observability tiers, such as APM and metrics, to gain visibility without the high cost of full log ingestion
  • Aligning observability costs with the value provided to the business and customers

Lessons Learned and Key Takeaways

  1. Start with a tagging strategy before building dashboards
  2. Invest early in per-tenant KPIs and heat maps to identify performance issues
  3. Treat cost as a first-class dimension in observability decisions
  4. Ensure proper metadata and context is captured to enable proactive issue resolution
  5. Foster a company culture that is comfortable discussing and optimizing observability costs

Technical Details and Business Impact

  • Zenesk uses AWS services like Aurora, ElastiCache, and EC2 to power their high-cardinality infrastructure
  • Improved response times and customer satisfaction by gaining better visibility into the infrastructure
  • Optimized cloud resource utilization and observability costs through iterative improvements

Real-World Examples and Results

  • Reduced log ingestion and indexing costs by only capturing necessary data (e.g., errors, specific customer traces)
  • Leveraged APM and metrics to gain visibility without the high cost of full log ingestion
  • Aligned observability costs with the value provided to the business and customers, avoiding budget conflicts

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.