TalksScaling Prime Video for peak NFL streaming on AWS (ARC311)
Scaling Prime Video for peak NFL streaming on AWS (ARC311)
Here is a detailed summary of the video transcription in Markdown format, with the key takeaways organized into sections:
Prime Video's Strategies for Scaling NFL Thursday Night Football on AWS
Challenges and Opportunities
The primary challenge was delivering a seamless streaming experience to millions of viewers during the highly variable and spiky traffic patterns of NFL Thursday Night Football games.
The peak viewership surged from 10 million in 2022 to 18 million in 2024, putting tremendous pressure on Prime Video's infrastructure.
The need to scale infrastructure cost-effectively to meet this demand while maintaining high availability was the key business opportunity.
Multi-Region Architecture
Prime Video adopted a multi-region architecture to enhance reliability and resilience:
Instance type flexibility: Using a diverse set of instance types to increase the available capacity pool.
AZ flexibility: Leveraging multiple Availability Zones within a region for fault tolerance.
Multi-region flexibility: Extending the architecture across multiple AWS Regions to handle regional outages and optimize for latency.
This multi-region approach was implemented in three key areas of the Prime Video stack:
Signal Delivery: The live signal ingestion stack was built from the ground up to support multi-region.
Playback: The playback stack was globalized to enable regional failover and consistent user experience.
Application Storefront: The storefront stack was also migrated to a multi-region architecture, including data replication and globalization.
Elastic Scaling
The unique challenges for scaling Prime Video's infrastructure for live sports events include:
Highly variable "peak-to-mean" ratio, with the NFL Thursday Night Football games causing orders of magnitude spikes in traffic.
Spiky traffic patterns within the games, with sharp increases at kickoff and halftime.
The need for coordination across hundreds of distributed service teams to scale up and down.
To address these challenges, Prime Video built a centralized auto-scaling solution that:
Leverages a forecasting system to predict demand and optimize capacity planning.
Provides a central hub to route scaling signals and manage the auto-scaling process.
Embeds a transformation library within each service team to enable automated, service-specific scaling.
Well-Architected Framework
The key design principles Prime Video applied to ensure reliability and resilience included:
Automatic recovery from failures
Testing beyond destruction to validate recovery procedures
Horizontal scaling to increase aggregate workload availability
Stopping "guessing" of capacity and leveraging automation
Other best practices included:
Understanding data consistency, availability, and partition tolerance trade-offs
Identifying and managing dependencies, both internal and external
Ensuring operational readiness through automated recovery procedures and cross-region health checks
Key Outcomes
Achieved operational resilience to support the growth in millions of users during NFL Thursday Night Football games.
Enabled dynamic scaling to handle the highly variable "peak-to-mean" ratio, with a 20% reduction in operational costs.
Realized a 43% reduction in carbon footprint by optimizing the number of EC2 instances and adopting Graviton.
The overarching principle guiding Prime Video's approach was the understanding that "everything fails all the time," and building a system that embraces failure as a natural occurrence through multi-region architectures and automated scaling.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.