How Netflix benchmarks FMs and LLMs across hardware chipsets (NFX307)

Introduction

Amirali Roudaki, Performance Engineer, Cloud Infrastructure Engineering, Netflix

Amit Aurora, Principal Solution Architect, Amazon Web Services (AWS)

Exploring how Netflix evaluates AWS instances, focusing on AWS accelerated compute (e.g., Nvidia GPUs, AWS silicon premium instances)

Discussing Netflix's CI/CD platform and benchmarking framework used to validate AWS instances for their workloads

Exploring large language models (LLMs), popular inference engine frameworks, and serving techniques used for optimizing model execution and deployment

Netflix's Generative AI Initiatives

Netflix's streaming service is famous for its recommendation and personalization, driven by machine learning algorithms

The demand for accelerated computing at Netflix continues to grow due to the transformative potential of Generative AI (Gen AI)

Notable Gen AI projects at Netflix include:

Text-to-image generation
Conversational-based search
Real-time adaptive recommendations
React-based search use cases

Evaluating AWS Instances for Gen AI Workloads

Netflix uses industry-standard benchmarks, production load canary, and stress testing, integrated into their CI/CD platform and automated test harness

They determine service accuracy, latency, and throughput requirements, then iterate over various foundation models, AWS instance sizes/types, and service stacks to identify the best performance and scalability

This automated approach empowers Netflix service owners to make data-driven decisions on deployment cost, capacity requirements, and optimizing pre-scaling and autoscaling targets

Importance of Model Parameters, Quantization, and Inference Engines

Model parameters (weights, activations) represent the neural network's learned knowledge and are crucial for ensuring accuracy in LLMs

Tokenizers and embeddings enable contextual and deeper understanding of user prompts and efficient text generation

Optimizing compute and memory usage through quantization techniques (e.g., mixed precision, low precision) is important for LLM inference

Inference engines like TensorRT and TVM are used to optimize model execution and hardware utilization

Inference servers act as an intermediary between user requests and the inference engine, managing GPU resources and providing features like tensor parallelism, concurrent batching, and LLM-based optimizations

AWS Foundation Model Benchmarking (FM Bench) Tool

FM Bench is an open-source package for benchmarking any foundation model on any AWS generative AI service

It is model-agnostic and AWS service-agnostic, allowing you to benchmark various models on different platforms (EC2, SageMaker, Bedrock)

FM Bench provides a unified configuration file for testing different combinations of instance types, inference engines, and serving stacks

It generates detailed reports with metrics like inference latency, time to first/last token, transaction throughput, and token throughput, enabling data-driven decisions

Example FM Bench Report and Insights

The report compares the price-performance of different instance types (e.g., P4D, G6D) for a given workload and latency budget

Larger bubbles indicate better throughput, and the performance delta is due to factors like HBM and GPU-to-GPU NV-Link

The report also includes charts to help determine the right serving stack (instance type and count) to meet specific throughput requirements

Getting Started with FM Bench

The FM Bench Orchestrator repository provides a companion tool to simplify the benchmarking process

The FM Bench website has detailed instructions on installation, configuration, and interpreting the reports

FM Bench is an open-source project, and users are encouraged to create issues on GitHub or reach out on LinkedIn for any requests or feedback

How Netflix benchmarks FMs and LLMs across hardware chipsets (NFX307)

Netflix's Evaluation of AWS Instances for Generative AI Workloads

Introduction

Netflix's Generative AI Initiatives

Evaluating AWS Instances for Gen AI Workloads

Importance of Model Parameters, Quantization, and Inference Engines

AWS Foundation Model Benchmarking (FM Bench) Tool

Example FM Bench Report and Insights

Getting Started with FM Bench

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

How Netflix benchmarks FMs and LLMs across hardware chipsets (NFX307)

Netflix's Evaluation of AWS Instances for Generative AI Workloads

Introduction

Netflix's Generative AI Initiatives

Evaluating AWS Instances for Gen AI Workloads

Importance of Model Parameters, Quantization, and Inference Engines

AWS Foundation Model Benchmarking (FM Bench) Tool

Example FM Bench Report and Insights

Getting Started with FM Bench

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.