How Netflix handles sudden load spikes in the cloud (NFX301)

Handling Sudden Load Spikes in the Cloud at Netflix

Introduction

  • Netflix has a global audience of over 600 million users and an ever-expanding catalog of content.
  • Managing traffic spikes is crucial to ensure a seamless viewing experience for users.
  • This presentation showcases how Netflix, in partnership with AWS, architects solutions to handle sudden load spikes.

The Problem

  • Netflix runs an active-active architecture across multiple AWS regions to support region failover and ensure resilience.
  • Historically, Netflix has had gradual traffic patterns, with a 10x difference between the trough and peak.
  • However, there are exceptions, such as sudden load spikes due to new title launches, external events, or internal service issues.
  • The complexity arises from Netflix's microservices architecture and the varying impact on different services in the call graph.
  • The concept of "buffer" is crucial, which refers to the ability of a service to handle a load spike and maintain successful responses or preserve the health of the system.

Proactive Solutions

  1. Predictive Scaling: If a load spike is anticipated, such as a new title launch, Netflix scales up the fleet ahead of time to match the expected load.
  2. Traffic Shaping: Instead of scaling up one or two regions significantly, Netflix distributes the load across all regions to reduce risk and enable better autoscaling.

Reactive Solutions

  1. Improved Autoscaling: Netflix has enhanced its autoscaling policies to be more responsive to sudden load spikes:
    • Scaling based on requests per second (RPS) rather than CPU utilization, which provides a better signal for the scaling action needed.
    • Implementing an "RPS Hammer" policy, a step-scaling approach to quickly add the right amount of capacity.
    • Using higher-resolution metrics (5-second resolution) to detect load increases faster.
  2. Prioritized Load Shedding:
    • Netflix establishes a criticality hierarchy for services, with different tiers of importance.
    • Within a service, requests are tagged with different priorities (critical, degraded, best-effort, bulk).
    • During a load spike, low-priority requests are shed first to preserve capacity for critical traffic.
  3. Cross-Region Retries:
    • If a service is throttling in one region, the request is retried in a different region, with a degraded priority to avoid cascading failures.

Validation and Testing

  • Netflix uses a "Resilience Testing Pyramid" to ensure these mechanisms work at scale:
    1. Synthetic load testing on individual services.
    2. Production-based testing using real traffic.
    3. Region-scale tests by redirecting all global traffic to a single region.
    4. Region load testing by simulating user flows and failure scenarios.

Conclusion

  • Netflix has made significant progress in reducing time to recover from load spikes, using region failover less, and shifting its resilience posture to assume load spikes are a constant.
  • Key takeaways:
    1. Combine proactive and reactive mechanisms to handle load spikes effectively.
    2. Prioritize requests and shed low-priority traffic to preserve capacity for critical workloads.
    3. Continuously test in production to validate the resilience of the system.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us