AWS re:Invent 2025 - How Netflix Shapes our Fleet for Efficiency and Reliability (IND387)

Balancing Efficiency and Reliability at Netflix

Understanding Efficiency and Reliability

Efficiency is about maximizing the business value of workloads while minimizing the cost of running them.

Reliability is about ensuring predictable performance, infrequent failures, and fast recovery when failures do occur.

Efficiency and reliability are complementary - Netflix views them as a single construct, not separate goals.

Key efficiency metrics include cost, resource usage, and the risk-adjusted net value of workloads.

Key reliability metrics include mean time between failures, mean time to recovery, and the blast radius of failures.

Managing the Supply of Compute Resources

Capacity planning is critical to allocate compute resources efficiently across Netflix's fleet.

Netflix models both stateful (e.g. databases) and stateless (e.g. microservices) workloads differently for capacity planning.

Reserving the right mix of reserved and on-demand capacity is key to balancing efficiency and cost.

Monitoring hardware availability and characteristics (e.g. generation, shape) is crucial to plan for volatile capacity.

Netflix uses the concepts of "success buffer" and "failure buffer" to model headroom and plan capacity.

Buffers vary by workload type and hardware - Netflix tailors them for critical vs. non-critical services.

Understanding Compute Demand

Profiling workloads to understand CPU, memory, and network resource needs is foundational.

Observing production workloads is crucial to validate models and understand real-world usage patterns.

Accounting for service startup times is important when planning for capacity.

Netflix's microservices architecture creates complex, non-linear call patterns that impact demand estimation.

Predictable daily and weekly traffic patterns are complemented by unpredictable spikes during major events.

Balancing Supply and Demand

Fleet shaping - matching workloads to optimal hardware based on performance, cost, and capacity constraints.

Pre-scaling - proactively scaling the fleet in anticipation of traffic spikes to maintain efficiency and reliability.

Dynamic traffic shaping - redistributing existing and new traffic across regions to balance load.

Reactive auto-scaling - rapidly adding or removing capacity in response to demand changes.

Prioritized load shedding - intelligently shedding non-critical traffic to protect core functionality.

Key Takeaways

Efficiency and reliability must be managed holistically, not as separate goals.

Detailed modeling and observation of both supply and demand is critical.

Proactive and reactive techniques are needed to balance efficiency and reliability.

Compound efficiency wins are possible at Netflix's scale.

End-to-end traffic management and "math as a safety blanket" are key principles.

AWS re:Invent 2025 - How Netflix Shapes our Fleet for Efficiency and Reliability (IND387)

Balancing Efficiency and Reliability at Netflix

Understanding Efficiency and Reliability

Managing the Supply of Compute Resources

Understanding Compute Demand

Balancing Supply and Demand

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - How Netflix Shapes our Fleet for Efficiency and Reliability (IND387)

Balancing Efficiency and Reliability at Netflix

Understanding Efficiency and Reliability

Managing the Supply of Compute Resources

Understanding Compute Demand

Balancing Supply and Demand

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.