How Netflix benchmarks FMs and LLMs across hardware chipsets (NFX307)

Netflix's Evaluation of AWS Instances for Generative AI Workloads

Introduction

  • Amirali Roudaki, Performance Engineer, Cloud Infrastructure Engineering, Netflix
  • Amit Aurora, Principal Solution Architect, Amazon Web Services (AWS)
  • Exploring how Netflix evaluates AWS instances, focusing on AWS accelerated compute (e.g., Nvidia GPUs, AWS silicon premium instances)
  • Discussing Netflix's CI/CD platform and benchmarking framework used to validate AWS instances for their workloads
  • Exploring large language models (LLMs), popular inference engine frameworks, and serving techniques used for optimizing model execution and deployment

Netflix's Generative AI Initiatives

  • Netflix's streaming service is famous for its recommendation and personalization, driven by machine learning algorithms
  • The demand for accelerated computing at Netflix continues to grow due to the transformative potential of Generative AI (Gen AI)
  • Notable Gen AI projects at Netflix include:
    • Text-to-image generation
    • Conversational-based search
    • Real-time adaptive recommendations
    • React-based search use cases

Evaluating AWS Instances for Gen AI Workloads

  • Netflix uses industry-standard benchmarks, production load canary, and stress testing, integrated into their CI/CD platform and automated test harness
  • They determine service accuracy, latency, and throughput requirements, then iterate over various foundation models, AWS instance sizes/types, and service stacks to identify the best performance and scalability
  • This automated approach empowers Netflix service owners to make data-driven decisions on deployment cost, capacity requirements, and optimizing pre-scaling and autoscaling targets

Importance of Model Parameters, Quantization, and Inference Engines

  • Model parameters (weights, activations) represent the neural network's learned knowledge and are crucial for ensuring accuracy in LLMs
  • Tokenizers and embeddings enable contextual and deeper understanding of user prompts and efficient text generation
  • Optimizing compute and memory usage through quantization techniques (e.g., mixed precision, low precision) is important for LLM inference
  • Inference engines like TensorRT and TVM are used to optimize model execution and hardware utilization
  • Inference servers act as an intermediary between user requests and the inference engine, managing GPU resources and providing features like tensor parallelism, concurrent batching, and LLM-based optimizations

AWS Foundation Model Benchmarking (FM Bench) Tool

  • FM Bench is an open-source package for benchmarking any foundation model on any AWS generative AI service
  • It is model-agnostic and AWS service-agnostic, allowing you to benchmark various models on different platforms (EC2, SageMaker, Bedrock)
  • FM Bench provides a unified configuration file for testing different combinations of instance types, inference engines, and serving stacks
  • It generates detailed reports with metrics like inference latency, time to first/last token, transaction throughput, and token throughput, enabling data-driven decisions

Example FM Bench Report and Insights

  • The report compares the price-performance of different instance types (e.g., P4D, G6D) for a given workload and latency budget
  • Larger bubbles indicate better throughput, and the performance delta is due to factors like HBM and GPU-to-GPU NV-Link
  • The report also includes charts to help determine the right serving stack (instance type and count) to meet specific throughput requirements

Getting Started with FM Bench

  • The FM Bench Orchestrator repository provides a companion tool to simplify the benchmarking process
  • The FM Bench website has detailed instructions on installation, configuration, and interpreting the reports
  • FM Bench is an open-source project, and users are encouraged to create issues on GitHub or reach out on LinkedIn for any requests or feedback

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us