TalksAWS re:Invent 2025 - High-performance inference for frontier AI models (AIM226)

AWS re:Invent 2025 - High-performance inference for frontier AI models (AIM226)

High-Performance Inference for Frontier AI Models

Overview

  • Presented by Philip from Base 10, an inference provider on the AWS Marketplace
  • Covers the concept of inference engineering, the rise of open-source foundation models, the components of an inference stack, and real-world production use cases

The Rise of Open-Source Frontier AI Models

  • There are now over 2 million open-source models available on platforms like Hugging Face
  • These models are reaching frontier-level quality across various domains, including language, speech, image, and video
  • Examples include the Kimmy K2 language model and DeepSpeech for automatic speech recognition

Principles of Inference Engineering

  1. Optimization requires constraints - Need to define specific performance goals to optimize for
  2. Scale unlocks more performance techniques - Leveraging large-scale parallelism, disaggregation, etc. requires sufficient traffic volume
  3. Stay dynamic - The system should be able to update itself in real-time to adapt to changing traffic patterns

The Inference Stack

  1. Runtime Performance Optimization:

    • Quantization - Reducing precision from 16-bit to 8-bit or 4-bit to leverage tensor cores and memory bandwidth
    • Selective quantization - Quantizing only certain model components (e.g., weights, activations) to preserve quality
    • Speculative decoding - Using algorithms like Eagle 3 to generate draft tokens and increase tokens per second
    • Caching - Leveraging KV cache aware routing to achieve 2x faster end-to-end performance for use cases like code completion
    • Parallelism - Balancing techniques like tensor and expert parallelism to optimize for latency and throughput
    • Disaggregation - Separating pre-fill and decode onto independent workers to specialize each component
  2. Scalable Infrastructure:

    • Autoscaling - Dynamically provisioning GPU capacity to match fluctuating traffic patterns
    • Multicluster capacity management - Leveraging compute resources across multiple regions and clusters to provide active-active reliability and global proximity

Real-World Use Cases

  1. Open Evidence: A healthcare startup serving billions of custom and fine-tuned language model calls per week to healthcare providers, enabled by Base 10's high-performance and reliable inference stack.
  2. Zed: An IDE provider that achieved 2x faster end-to-end code completion, 45% lower P90 latency, and 3.5x higher throughput using Base 10's inference stack with KV cache optimization.
  3. Latent: A pharmaceutical search company that leveraged Base 10's multicluster strategy and autoscaling capabilities to implement highly reliable inference.
  4. Superhuman: An email app (acquired by Grammarly) that used Base 10's embedding inference to cut P95 latency by 80% across fine-tuned embedding models powering key app features.

Key Takeaways

  • Inference engineering is a critical discipline for deploying frontier AI models in production
  • Optimizing both runtime performance and scalable infrastructure is essential for delivering high-performance, reliable, and cost-effective inference
  • Techniques like quantization, speculative decoding, caching, and parallelism can significantly improve runtime performance
  • Autoscaling and multicluster capacity management are crucial for handling fluctuating traffic and ensuring global availability
  • The inference stack can be adapted to support a wide range of AI model modalities, including language, speech, image, and video

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.