TalksAWS re:Invent 2025 - Scaling foundation model inference on Amazon SageMaker AI (AIM424)

AWS re:Invent 2025 - Scaling foundation model inference on Amazon SageMaker AI (AIM424)

Scaling Foundation Model Inference on Amazon SageMaker AI

Key Trends in 2025

  • Rise of "agentic" workflows where AI agents can take actions to accomplish goals, not just provide responses
  • Increase in "reasoning" models that go through a chain of thought process, generating more tokens per inference
  • Prediction that over a third of applications will have some form of agentic workflows by 2028

Challenges in Deploying Large Language Models (LLMs) at Scale

  1. Performance: Ensuring ideal response times as concurrency and load increases
  2. Cost: Maximizing GPU utilization to keep costs in check as models become more compute-intensive
  3. Scalability: Efficiently managing resources and scaling up/down based on demand
  4. Flexibility: Complexity in setting up infrastructure, containers, and optimizations for diverse model types

How Amazon SageMaker AI Addresses These Challenges

1. Price Performance

  • Introduced "Eagle" speculative decoding to increase throughput by 2.5x without accuracy trade-offs
  • Enabled dynamic loading and offloading of LORA adapters to maximize GPU memory utilization
  • Launched "Inference Components" to deploy multiple model copies on the same GPU instance
  • Improved autoscaling performance by 50% through container caching on NVME volumes
  • Provided load-aware and session-aware routing to optimize latency

2. Flexibility

  • Supports bringing your own containers and inference scripts for customization
  • Benchmarked open-source models like GPT-J to be on par with closed-source alternatives
  • Launched bi-directional streaming for real-time use cases like audio transcription and translation
  • Integrated with Deepgram's speech models for easy deployment on SageMaker

3. Ease of Use

  • Introduced self-service GPU capacity reservations to enable experimentation and testing
  • Provided out-of-the-box observability dashboards to monitor model and infrastructure performance
  • Offered managed containers like the LMI container with optimizations for SageMaker

Building Agentic Workflows with SageMaker AI

  • SageMaker endpoints can be integrated with frameworks like Langchain to build agent-based applications
  • Salesforce demonstrated their "Agent Force" platform, which uses SageMaker for low-latency, multi-channel voice interactions
  • Salesforce leverages SageMaker's inference components, multi-adapter support, and custom model deployment capabilities to optimize their model serving strategies

Key Takeaways

  • SageMaker AI provides a comprehensive platform to efficiently deploy and scale large language models in production
  • New capabilities like speculative decoding, dynamic LORA adapter management, and inference components enable high throughput and cost-effective inference
  • Flexibility to bring any model or framework, along with ease-of-use features, simplify the journey from model to production
  • Integration with agent-based platforms like Salesforce Agent Force showcases the real-world applications of agentic AI workflows powered by SageMaker

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.