Control observability data cost & complexity for Amazon EKS workloads (KUB101)
Managing Observability in EKS Environments
Introduction
The speaker, Jesse Rodriguez, is a Technical Account Manager at Chronosphere.
The talk focuses on managing the cost and complexity of observability in Kubernetes (EKS) environments.
Observability Challenges in Kubernetes
In the VM era, observability data patterns were more predictable, and Telemetry volumes were manageable.
The Kubernetes era presents new challenges:
Modern Kubernetes environments generate orders of magnitude more data, with a 10-100x increase in time series data.
Exponential growth in metrics cardinality, with a 250% year-over-year increase in log volume.
Impact on Cost and Productivity
Factors influencing observability costs:
Number of containers, high metric granularity, log verbosity, and retention policies.
Cardinality explosion, leading to exponential increase in query complexity and processing costs.
Impact on developer productivity:
87% of engineers say cloud-native architecture has increased the complexity of incident discovery and troubleshooting.
Engineers spend an average of 10 hours per week (25% of their work week) trying to triage and understand incidents.
88% report that the time spent on issues negatively impacts their careers, leading to burnout.
Strategies for Reducing Cost and Noise
Low-Hanging Fruit:
Drop in metrics: Disable unnecessary metrics from tools like cAdvisor and keep-state.
Logs: Reroute seldom-used data to object storage and sample information-level logs.
Traces: Set global header and/or tail sampling to capture only interesting traces.
Advanced Solutions:
Aggregation:
Remove unused metric dimensions.
Create rollup metrics to reduce time stamp precision for long-term trending.
Logs-to-Metrics:
Summarize logs into metrics without ingesting raw data.
Convert detailed error logs into error rate metrics.
Tiered Sampling in Traces:
Capture a higher percentage of traces for revenue-critical paths.
Reduce sampling rates for lower-priority flows.
Chronosphere's Approach
Chronosphere's observability platform and Telemetry pipeline provide control and optimization capabilities:
The control plane analyzes the value of data and allows teams to optimize before storage.
The Telemetry pipeline processes data in-flight to reduce, transform, and enrich logs.
Chronosphere has helped customers save 60% on observability costs and 30% on logging costs.
Case Study: Affirm
Affirm, a leading buy-now-pay-later company, faced challenges with observability during high-traffic events like Black Friday.
Chronosphere's aggregation and filtering capabilities allowed Affirm to control costs while maintaining high data quality for developers.
Chronosphere's platform demonstrated robust capabilities, with 99.9% availability, and enabled Affirm to increase data ingestion and achieve significant cost savings.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.