Talks AWS re:Invent 2025 - vLLM on AWS: testing to production and everything in between (OPN414) VIDEO
AWS re:Invent 2025 - vLLM on AWS: testing to production and everything in between (OPN414) Deploying Large Language Models (LLMs) on AWS: A Comprehensive Journey
Foundations: Establishing Visibility and Control
Challenges with ad-hoc LLM usage across the organization:
Lack of centralized cost visibility
Difficulty enforcing policies (e.g., content filtering)
Compliance concerns
Solution: Implement an AI gateway to:
Track costs by user, team, or project
Enforce policies (e.g., content filtering for PII)
Audit access and control API keys
Examples of AI gateway frameworks: Envoy AI Gateway, LightLLM, OpenRouter
Optimization: Maximizing Resource Utilization
Naive LLM deployment can result in 40-50% GPU utilization
Understanding the Transformer architecture:
Tokenization (CPU-bound)
Prefill (GPU/accelerator-bound)
Decode (memory-bound)
Importance of managing context window size to avoid out-of-memory errors
Leveraging inference engines like VLM to improve performance:
Page attention (virtual memory for GPUs)
Continuous batching (dynamic batch size optimization)
Quantization (reduced precision data types)
Benchmarks show 5x better throughput and 80% cost savings per token using VLM
Latency Optimization: Caching and Offloading
Challenge: Latency issues due to repeated processing of system prompts and templates
Techniques to address this:
Prefix caching: Caching frequently used token sequences
KV cache offloading: Offloading the key-value cache to different memory/storage tiers
Semantic caching: Caching based on semantic similarity of requests
AI-aware routing: Routing requests to nodes with pre-computed KV caches
Benchmarks show 3x better performance for time-to-first-token and 2x better overall latency
Distributed Inference: Scaling for High-Volume Workloads
When to consider distributed inference:
High volume of traffic (thousands of requests per second/minute)
Models that cannot fit on a single node
Parallelism strategies:
Data parallelism: Duplicate model, shard data
Tensor parallelism: Shard model weights
Pipeline parallelism: Shard model layers
Expert parallelism: Shard expert modules in mixture-of-experts models
Disaggregated architecture: Separate prefill and decode stages for better resource allocation
Benchmarks show 2x better throughput when scaling from 1 to 2 nodes
LLM Gateways: Centralized Control and Optimization
Benefits of LLM gateways:
Routing and error handling
Centralized observability and guardrails
Credential management and cost attribution
Intelligent routing capabilities:
Routing large requests to frontier models
Retrying failed requests on alternative servers
Advanced optimization with AI Bricks:
Context-aware load balancing
Adaptive model/adapter management
Distributed KV cache management
Open-Source Tools and Resources
AI on EKS: Open-source project providing:
Purpose-built infrastructure for training, inference, and MLOps
Deployable blueprints and charts for various LLM use cases
Practical guidance on performance, cost, and hardware optimization
Workshops and skill-building resources available
Key Takeaways
Establish an AI gateway to gain visibility and control over LLM usage across the organization
Leverage inference engines like VLM to maximize GPU utilization and reduce costs
Implement caching and offloading strategies to optimize latency for LLM applications
Consider distributed inference architectures to scale for high-volume workloads
Use LLM gateways to centralize control, observability, and advanced optimization capabilities
Leverage open-source tools and resources like AI on EKS to accelerate LLM deployment and operations
Your Digital Journey deserves a great story. Build one with us.