Talks AWS re:Invent 2025-Finding the Noisy Neighbor: Patterns for Per‑Customer Performance at Scale-MAM354 VIDEO
AWS re:Invent 2025-Finding the Noisy Neighbor: Patterns for Per‑Customer Performance at Scale-MAM354 Finding the Noisy Neighbor: Patterns for Per‑Customer Performance at Scale
Challenges of High Cardinality Infrastructure
Zenesk, an AWS partner, has an infrastructure with hundreds of thousands of customers, leading to high cardinality challenges:
Trying to monitor and troubleshoot issues for a small subset of customers on a shared infrastructure
Balancing observability, cost, and data sensitivity
Dealing with the "teeter totter" of cost and observability
Defining the "Noisy Neighbor" Problem
The "noisy neighbor" analogy - one customer's activity negatively impacting the performance for other customers on a shared infrastructure
Troubleshooting a customer incident with limited visibility into the underlying infrastructure
Improving Observability with Tagging and Tracing
Tagging infrastructure components (e.g., tenants, microservices) to gain more visibility into the customer's issue
Using APM (Application Performance Monitoring) tools like Datadog to trace the entire call flow and identify potential bottlenecks
The "REST" Approach to Observability
Recognize : Identify the key metrics, logs, and infrastructure monitoring needed to detect issues
Examine : Analyze the data to determine the root cause, such as hot partitions, backlogs, or latency spikes
Shape : Implement controls and limits to protect customers, such as rate limiting or adjusting resource allocations
Test : Regularly review the effectiveness of the observability setup and make iterative improvements
Optimizing Observability Costs
Reducing log ingestion and indexing by only capturing necessary data (e.g., errors, specific customer traces)
Leveraging less expensive observability tiers, such as APM and metrics, to gain visibility without the high cost of full log ingestion
Aligning observability costs with the value provided to the business and customers
Lessons Learned and Key Takeaways
Start with a tagging strategy before building dashboards
Invest early in per-tenant KPIs and heat maps to identify performance issues
Treat cost as a first-class dimension in observability decisions
Ensure proper metadata and context is captured to enable proactive issue resolution
Foster a company culture that is comfortable discussing and optimizing observability costs
Technical Details and Business Impact
Zenesk uses AWS services like Aurora, ElastiCache, and EC2 to power their high-cardinality infrastructure
Improved response times and customer satisfaction by gaining better visibility into the infrastructure
Optimized cloud resource utilization and observability costs through iterative improvements
Real-World Examples and Results
Reduced log ingestion and indexing costs by only capturing necessary data (e.g., errors, specific customer traces)
Leveraged APM and metrics to gain visibility without the high cost of full log ingestion
Aligned observability costs with the value provided to the business and customers, avoiding budget conflicts
Your Digital Journey deserves a great story. Build one with us.