AWS re:Invent 2025 - Scaling foundation model inference on Amazon SageMaker AI (AIM424)

Scaling Foundation Model Inference on Amazon SageMaker AI

Key Trends in 2025

Rise of "agentic" workflows where AI agents can take actions to accomplish goals, not just provide responses

Increase in "reasoning" models that go through a chain of thought process, generating more tokens per inference

Prediction that over a third of applications will have some form of agentic workflows by 2028

Challenges in Deploying Large Language Models (LLMs) at Scale

Performance: Ensuring ideal response times as concurrency and load increases

Cost: Maximizing GPU utilization to keep costs in check as models become more compute-intensive

Scalability: Efficiently managing resources and scaling up/down based on demand

Flexibility: Complexity in setting up infrastructure, containers, and optimizations for diverse model types

How Amazon SageMaker AI Addresses These Challenges

1. Price Performance

Introduced "Eagle" speculative decoding to increase throughput by 2.5x without accuracy trade-offs

Enabled dynamic loading and offloading of LORA adapters to maximize GPU memory utilization

Launched "Inference Components" to deploy multiple model copies on the same GPU instance

Improved autoscaling performance by 50% through container caching on NVME volumes

Provided load-aware and session-aware routing to optimize latency

2. Flexibility

Supports bringing your own containers and inference scripts for customization

Benchmarked open-source models like GPT-J to be on par with closed-source alternatives

Launched bi-directional streaming for real-time use cases like audio transcription and translation

Integrated with Deepgram's speech models for easy deployment on SageMaker

3. Ease of Use

Introduced self-service GPU capacity reservations to enable experimentation and testing

Provided out-of-the-box observability dashboards to monitor model and infrastructure performance

Offered managed containers like the LMI container with optimizations for SageMaker

Building Agentic Workflows with SageMaker AI

SageMaker endpoints can be integrated with frameworks like Langchain to build agent-based applications

Salesforce demonstrated their "Agent Force" platform, which uses SageMaker for low-latency, multi-channel voice interactions

Salesforce leverages SageMaker's inference components, multi-adapter support, and custom model deployment capabilities to optimize their model serving strategies

Key Takeaways

SageMaker AI provides a comprehensive platform to efficiently deploy and scale large language models in production

New capabilities like speculative decoding, dynamic LORA adapter management, and inference components enable high throughput and cost-effective inference

Flexibility to bring any model or framework, along with ease-of-use features, simplify the journey from model to production

Integration with agent-based platforms like Salesforce Agent Force showcases the real-world applications of agentic AI workflows powered by SageMaker

AWS re:Invent 2025 - Scaling foundation model inference on Amazon SageMaker AI (AIM424)

Scaling Foundation Model Inference on Amazon SageMaker AI

Key Trends in 2025

Challenges in Deploying Large Language Models (LLMs) at Scale

How Amazon SageMaker AI Addresses These Challenges

1. Price Performance

2. Flexibility

3. Ease of Use

Building Agentic Workflows with SageMaker AI

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Scaling foundation model inference on Amazon SageMaker AI (AIM424)

Scaling Foundation Model Inference on Amazon SageMaker AI

Key Trends in 2025

Challenges in Deploying Large Language Models (LLMs) at Scale

How Amazon SageMaker AI Addresses These Challenges

1. Price Performance

2. Flexibility

3. Ease of Use

Building Agentic Workflows with SageMaker AI

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.