TalksAWS re:Invent 2025 - vLLM on AWS: testing to production and everything in between (OPN414)

AWS re:Invent 2025 - vLLM on AWS: testing to production and everything in between (OPN414)

Deploying Large Language Models (LLMs) on AWS: A Comprehensive Journey

Foundations: Establishing Visibility and Control

  • Challenges with ad-hoc LLM usage across the organization:
    • Lack of centralized cost visibility
    • Difficulty enforcing policies (e.g., content filtering)
    • Compliance concerns
  • Solution: Implement an AI gateway to:
    • Track costs by user, team, or project
    • Enforce policies (e.g., content filtering for PII)
    • Audit access and control API keys
  • Examples of AI gateway frameworks: Envoy AI Gateway, LightLLM, OpenRouter

Optimization: Maximizing Resource Utilization

  • Naive LLM deployment can result in 40-50% GPU utilization
  • Understanding the Transformer architecture:
    • Tokenization (CPU-bound)
    • Prefill (GPU/accelerator-bound)
    • Decode (memory-bound)
  • Importance of managing context window size to avoid out-of-memory errors
  • Leveraging inference engines like VLM to improve performance:
    • Page attention (virtual memory for GPUs)
    • Continuous batching (dynamic batch size optimization)
    • Quantization (reduced precision data types)
  • Benchmarks show 5x better throughput and 80% cost savings per token using VLM

Latency Optimization: Caching and Offloading

  • Challenge: Latency issues due to repeated processing of system prompts and templates
  • Techniques to address this:
    • Prefix caching: Caching frequently used token sequences
    • KV cache offloading: Offloading the key-value cache to different memory/storage tiers
    • Semantic caching: Caching based on semantic similarity of requests
    • AI-aware routing: Routing requests to nodes with pre-computed KV caches
  • Benchmarks show 3x better performance for time-to-first-token and 2x better overall latency

Distributed Inference: Scaling for High-Volume Workloads

  • When to consider distributed inference:
    • High volume of traffic (thousands of requests per second/minute)
    • Models that cannot fit on a single node
  • Parallelism strategies:
    • Data parallelism: Duplicate model, shard data
    • Tensor parallelism: Shard model weights
    • Pipeline parallelism: Shard model layers
    • Expert parallelism: Shard expert modules in mixture-of-experts models
  • Disaggregated architecture: Separate prefill and decode stages for better resource allocation
  • Benchmarks show 2x better throughput when scaling from 1 to 2 nodes

LLM Gateways: Centralized Control and Optimization

  • Benefits of LLM gateways:
    • Routing and error handling
    • Centralized observability and guardrails
    • Credential management and cost attribution
  • Intelligent routing capabilities:
    • Routing large requests to frontier models
    • Retrying failed requests on alternative servers
  • Advanced optimization with AI Bricks:
    • Context-aware load balancing
    • Adaptive model/adapter management
    • Distributed KV cache management

Open-Source Tools and Resources

  • AI on EKS: Open-source project providing:
    • Purpose-built infrastructure for training, inference, and MLOps
    • Deployable blueprints and charts for various LLM use cases
    • Practical guidance on performance, cost, and hardware optimization
  • Workshops and skill-building resources available

Key Takeaways

  • Establish an AI gateway to gain visibility and control over LLM usage across the organization
  • Leverage inference engines like VLM to maximize GPU utilization and reduce costs
  • Implement caching and offloading strategies to optimize latency for LLM applications
  • Consider distributed inference architectures to scale for high-volume workloads
  • Use LLM gateways to centralize control, observability, and advanced optimization capabilities
  • Leverage open-source tools and resources like AI on EKS to accelerate LLM deployment and operations

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.