AWS re:Invent 2025 - vLLM on AWS: testing to production and everything in between (OPN414)

Deploying Large Language Models (LLMs) on AWS: A Comprehensive Journey

Foundations: Establishing Visibility and Control

Challenges with ad-hoc LLM usage across the organization:

Lack of centralized cost visibility
Difficulty enforcing policies (e.g., content filtering)
Compliance concerns

Solution: Implement an AI gateway to:

Track costs by user, team, or project
Enforce policies (e.g., content filtering for PII)
Audit access and control API keys

Examples of AI gateway frameworks: Envoy AI Gateway, LightLLM, OpenRouter

Optimization: Maximizing Resource Utilization

Naive LLM deployment can result in 40-50% GPU utilization

Understanding the Transformer architecture:

Tokenization (CPU-bound)
Prefill (GPU/accelerator-bound)
Decode (memory-bound)

Importance of managing context window size to avoid out-of-memory errors

Leveraging inference engines like VLM to improve performance:

Page attention (virtual memory for GPUs)
Continuous batching (dynamic batch size optimization)
Quantization (reduced precision data types)

Benchmarks show 5x better throughput and 80% cost savings per token using VLM

Latency Optimization: Caching and Offloading

Challenge: Latency issues due to repeated processing of system prompts and templates

Techniques to address this:

Prefix caching: Caching frequently used token sequences
KV cache offloading: Offloading the key-value cache to different memory/storage tiers
Semantic caching: Caching based on semantic similarity of requests
AI-aware routing: Routing requests to nodes with pre-computed KV caches

Benchmarks show 3x better performance for time-to-first-token and 2x better overall latency

Distributed Inference: Scaling for High-Volume Workloads

When to consider distributed inference:

High volume of traffic (thousands of requests per second/minute)
Models that cannot fit on a single node

Parallelism strategies:

Data parallelism: Duplicate model, shard data
Tensor parallelism: Shard model weights
Pipeline parallelism: Shard model layers
Expert parallelism: Shard expert modules in mixture-of-experts models

Disaggregated architecture: Separate prefill and decode stages for better resource allocation

Benchmarks show 2x better throughput when scaling from 1 to 2 nodes

LLM Gateways: Centralized Control and Optimization

Benefits of LLM gateways:

Routing and error handling
Centralized observability and guardrails
Credential management and cost attribution

Intelligent routing capabilities:

Routing large requests to frontier models
Retrying failed requests on alternative servers

Advanced optimization with AI Bricks:

Context-aware load balancing
Adaptive model/adapter management
Distributed KV cache management

Open-Source Tools and Resources

AI on EKS: Open-source project providing:

Purpose-built infrastructure for training, inference, and MLOps
Deployable blueprints and charts for various LLM use cases
Practical guidance on performance, cost, and hardware optimization

Workshops and skill-building resources available

Key Takeaways

Establish an AI gateway to gain visibility and control over LLM usage across the organization

Leverage inference engines like VLM to maximize GPU utilization and reduce costs

Implement caching and offloading strategies to optimize latency for LLM applications

Consider distributed inference architectures to scale for high-volume workloads

Use LLM gateways to centralize control, observability, and advanced optimization capabilities

Leverage open-source tools and resources like AI on EKS to accelerate LLM deployment and operations

AWS re:Invent 2025 - vLLM on AWS: testing to production and everything in between (OPN414)

Deploying Large Language Models (LLMs) on AWS: A Comprehensive Journey

Foundations: Establishing Visibility and Control

Optimization: Maximizing Resource Utilization

Latency Optimization: Caching and Offloading

Distributed Inference: Scaling for High-Volume Workloads

LLM Gateways: Centralized Control and Optimization

Open-Source Tools and Resources

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - vLLM on AWS: testing to production and everything in between (OPN414)

Deploying Large Language Models (LLMs) on AWS: A Comprehensive Journey

Foundations: Establishing Visibility and Control

Optimization: Maximizing Resource Utilization

Latency Optimization: Caching and Offloading

Distributed Inference: Scaling for High-Volume Workloads

LLM Gateways: Centralized Control and Optimization

Open-Source Tools and Resources

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.