Amit Aurora, Principal Solution Architect, Amazon Web Services (AWS)
Exploring how Netflix evaluates AWS instances, focusing on AWS accelerated compute (e.g., Nvidia GPUs, AWS silicon premium instances)
Discussing Netflix's CI/CD platform and benchmarking framework used to validate AWS instances for their workloads
Exploring large language models (LLMs), popular inference engine frameworks, and serving techniques used for optimizing model execution and deployment
Netflix's Generative AI Initiatives
Netflix's streaming service is famous for its recommendation and personalization, driven by machine learning algorithms
The demand for accelerated computing at Netflix continues to grow due to the transformative potential of Generative AI (Gen AI)
Notable Gen AI projects at Netflix include:
Text-to-image generation
Conversational-based search
Real-time adaptive recommendations
React-based search use cases
Evaluating AWS Instances for Gen AI Workloads
Netflix uses industry-standard benchmarks, production load canary, and stress testing, integrated into their CI/CD platform and automated test harness
They determine service accuracy, latency, and throughput requirements, then iterate over various foundation models, AWS instance sizes/types, and service stacks to identify the best performance and scalability
This automated approach empowers Netflix service owners to make data-driven decisions on deployment cost, capacity requirements, and optimizing pre-scaling and autoscaling targets
Importance of Model Parameters, Quantization, and Inference Engines
Model parameters (weights, activations) represent the neural network's learned knowledge and are crucial for ensuring accuracy in LLMs
Tokenizers and embeddings enable contextual and deeper understanding of user prompts and efficient text generation
Optimizing compute and memory usage through quantization techniques (e.g., mixed precision, low precision) is important for LLM inference
Inference engines like TensorRT and TVM are used to optimize model execution and hardware utilization
Inference servers act as an intermediary between user requests and the inference engine, managing GPU resources and providing features like tensor parallelism, concurrent batching, and LLM-based optimizations
AWS Foundation Model Benchmarking (FM Bench) Tool
FM Bench is an open-source package for benchmarking any foundation model on any AWS generative AI service
It is model-agnostic and AWS service-agnostic, allowing you to benchmark various models on different platforms (EC2, SageMaker, Bedrock)
FM Bench provides a unified configuration file for testing different combinations of instance types, inference engines, and serving stacks
It generates detailed reports with metrics like inference latency, time to first/last token, transaction throughput, and token throughput, enabling data-driven decisions
Example FM Bench Report and Insights
The report compares the price-performance of different instance types (e.g., P4D, G6D) for a given workload and latency budget
Larger bubbles indicate better throughput, and the performance delta is due to factors like HBM and GPU-to-GPU NV-Link
The report also includes charts to help determine the right serving stack (instance type and count) to meet specific throughput requirements
Getting Started with FM Bench
The FM Bench Orchestrator repository provides a companion tool to simplify the benchmarking process
The FM Bench website has detailed instructions on installation, configuration, and interpreting the reports
FM Bench is an open-source project, and users are encouraged to create issues on GitHub or reach out on LinkedIn for any requests or feedback
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.