TalksAWS re:Invent 2025 - High-performance inference for frontier AI models (AIM226)
AWS re:Invent 2025 - High-performance inference for frontier AI models (AIM226)
High-Performance Inference for Frontier AI Models
Overview
Presented by Philip from Base 10, an inference provider on the AWS Marketplace
Covers the concept of inference engineering, the rise of open-source foundation models, the components of an inference stack, and real-world production use cases
The Rise of Open-Source Frontier AI Models
There are now over 2 million open-source models available on platforms like Hugging Face
These models are reaching frontier-level quality across various domains, including language, speech, image, and video
Examples include the Kimmy K2 language model and DeepSpeech for automatic speech recognition
Principles of Inference Engineering
Optimization requires constraints - Need to define specific performance goals to optimize for
Scale unlocks more performance techniques - Leveraging large-scale parallelism, disaggregation, etc. requires sufficient traffic volume
Stay dynamic - The system should be able to update itself in real-time to adapt to changing traffic patterns
The Inference Stack
Runtime Performance Optimization:
Quantization - Reducing precision from 16-bit to 8-bit or 4-bit to leverage tensor cores and memory bandwidth
Selective quantization - Quantizing only certain model components (e.g., weights, activations) to preserve quality
Speculative decoding - Using algorithms like Eagle 3 to generate draft tokens and increase tokens per second
Caching - Leveraging KV cache aware routing to achieve 2x faster end-to-end performance for use cases like code completion
Parallelism - Balancing techniques like tensor and expert parallelism to optimize for latency and throughput
Disaggregation - Separating pre-fill and decode onto independent workers to specialize each component
Scalable Infrastructure:
Autoscaling - Dynamically provisioning GPU capacity to match fluctuating traffic patterns
Multicluster capacity management - Leveraging compute resources across multiple regions and clusters to provide active-active reliability and global proximity
Real-World Use Cases
Open Evidence: A healthcare startup serving billions of custom and fine-tuned language model calls per week to healthcare providers, enabled by Base 10's high-performance and reliable inference stack.
Zed: An IDE provider that achieved 2x faster end-to-end code completion, 45% lower P90 latency, and 3.5x higher throughput using Base 10's inference stack with KV cache optimization.
Latent: A pharmaceutical search company that leveraged Base 10's multicluster strategy and autoscaling capabilities to implement highly reliable inference.
Superhuman: An email app (acquired by Grammarly) that used Base 10's embedding inference to cut P95 latency by 80% across fine-tuned embedding models powering key app features.
Key Takeaways
Inference engineering is a critical discipline for deploying frontier AI models in production
Optimizing both runtime performance and scalable infrastructure is essential for delivering high-performance, reliable, and cost-effective inference
Techniques like quantization, speculative decoding, caching, and parallelism can significantly improve runtime performance
Autoscaling and multicluster capacity management are crucial for handling fluctuating traffic and ensuring global availability
The inference stack can be adapted to support a wide range of AI model modalities, including language, speech, image, and video
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.