TalksAWS re:Invent 2025 - How Netflix Shapes our Fleet for Efficiency and Reliability (IND387)

AWS re:Invent 2025 - How Netflix Shapes our Fleet for Efficiency and Reliability (IND387)

Balancing Efficiency and Reliability at Netflix

Understanding Efficiency and Reliability

  • Efficiency is about maximizing the business value of workloads while minimizing the cost of running them.
  • Reliability is about ensuring predictable performance, infrequent failures, and fast recovery when failures do occur.
  • Efficiency and reliability are complementary - Netflix views them as a single construct, not separate goals.
  • Key efficiency metrics include cost, resource usage, and the risk-adjusted net value of workloads.
  • Key reliability metrics include mean time between failures, mean time to recovery, and the blast radius of failures.

Managing the Supply of Compute Resources

  • Capacity planning is critical to allocate compute resources efficiently across Netflix's fleet.
  • Netflix models both stateful (e.g. databases) and stateless (e.g. microservices) workloads differently for capacity planning.
  • Reserving the right mix of reserved and on-demand capacity is key to balancing efficiency and cost.
  • Monitoring hardware availability and characteristics (e.g. generation, shape) is crucial to plan for volatile capacity.
  • Netflix uses the concepts of "success buffer" and "failure buffer" to model headroom and plan capacity.
  • Buffers vary by workload type and hardware - Netflix tailors them for critical vs. non-critical services.

Understanding Compute Demand

  • Profiling workloads to understand CPU, memory, and network resource needs is foundational.
  • Observing production workloads is crucial to validate models and understand real-world usage patterns.
  • Accounting for service startup times is important when planning for capacity.
  • Netflix's microservices architecture creates complex, non-linear call patterns that impact demand estimation.
  • Predictable daily and weekly traffic patterns are complemented by unpredictable spikes during major events.

Balancing Supply and Demand

  • Fleet shaping - matching workloads to optimal hardware based on performance, cost, and capacity constraints.
  • Pre-scaling - proactively scaling the fleet in anticipation of traffic spikes to maintain efficiency and reliability.
  • Dynamic traffic shaping - redistributing existing and new traffic across regions to balance load.
  • Reactive auto-scaling - rapidly adding or removing capacity in response to demand changes.
  • Prioritized load shedding - intelligently shedding non-critical traffic to protect core functionality.

Key Takeaways

  • Efficiency and reliability must be managed holistically, not as separate goals.
  • Detailed modeling and observation of both supply and demand is critical.
  • Proactive and reactive techniques are needed to balance efficiency and reliability.
  • Compound efficiency wins are possible at Netflix's scale.
  • End-to-end traffic management and "math as a safety blanket" are key principles.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.