TalksAWS re:Invent 2025 - Scaling foundation model inference on Amazon SageMaker AI (AIM424)
AWS re:Invent 2025 - Scaling foundation model inference on Amazon SageMaker AI (AIM424)
Scaling Foundation Model Inference on Amazon SageMaker AI
Key Trends in 2025
Rise of "agentic" workflows where AI agents can take actions to accomplish goals, not just provide responses
Increase in "reasoning" models that go through a chain of thought process, generating more tokens per inference
Prediction that over a third of applications will have some form of agentic workflows by 2028
Challenges in Deploying Large Language Models (LLMs) at Scale
Performance: Ensuring ideal response times as concurrency and load increases
Cost: Maximizing GPU utilization to keep costs in check as models become more compute-intensive
Scalability: Efficiently managing resources and scaling up/down based on demand
Flexibility: Complexity in setting up infrastructure, containers, and optimizations for diverse model types
How Amazon SageMaker AI Addresses These Challenges
1. Price Performance
Introduced "Eagle" speculative decoding to increase throughput by 2.5x without accuracy trade-offs
Enabled dynamic loading and offloading of LORA adapters to maximize GPU memory utilization
Launched "Inference Components" to deploy multiple model copies on the same GPU instance
Improved autoscaling performance by 50% through container caching on NVME volumes
Provided load-aware and session-aware routing to optimize latency
2. Flexibility
Supports bringing your own containers and inference scripts for customization
Benchmarked open-source models like GPT-J to be on par with closed-source alternatives
Launched bi-directional streaming for real-time use cases like audio transcription and translation
Integrated with Deepgram's speech models for easy deployment on SageMaker
3. Ease of Use
Introduced self-service GPU capacity reservations to enable experimentation and testing
Provided out-of-the-box observability dashboards to monitor model and infrastructure performance
Offered managed containers like the LMI container with optimizations for SageMaker
Building Agentic Workflows with SageMaker AI
SageMaker endpoints can be integrated with frameworks like Langchain to build agent-based applications
Salesforce demonstrated their "Agent Force" platform, which uses SageMaker for low-latency, multi-channel voice interactions
Salesforce leverages SageMaker's inference components, multi-adapter support, and custom model deployment capabilities to optimize their model serving strategies
Key Takeaways
SageMaker AI provides a comprehensive platform to efficiently deploy and scale large language models in production
New capabilities like speculative decoding, dynamic LORA adapter management, and inference components enable high throughput and cost-effective inference
Flexibility to bring any model or framework, along with ease-of-use features, simplify the journey from model to production
Integration with agent-based platforms like Salesforce Agent Force showcases the real-world applications of agentic AI workflows powered by SageMaker
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.