TalksAWS re:Invent 2025 - How customers build AI at scale with AWS AI infrastructure (AIM252)

AWS re:Invent 2025 - How customers build AI at scale with AWS AI infrastructure (AIM252)

Scaling AI at Enterprise Scale: Lessons from AWS re:Invent 2025

Scaling AI Infrastructure: Challenges and Solutions

Scaling AI workloads presents unique challenges at enterprise scale:
- Hardware faults become more common with thousands of GPUs
- Networking issues can cause hard-to-debug errors and performance problems
- Model implementation details become critical for stability and efficiency
Key strategies for addressing these challenges:
- Rigorously test nodes and remove faulty hardware
- Minimize network hops and cross-zone/cross-spine traffic
- Carefully test and upgrade all software components together
- Optimize data loading and checkpointing to maximize GPU utilization

Scaling Up vs. Scaling Out

Scaling up: Upgrading to more powerful GPU instances (e.g. P6, H100, Trinium 3)
- Allows homogenizing hardware and simplifying management
- Can provide significant performance improvements for smaller models
Scaling out: Adding more servers and clusters (e.g. AWS Ultra Clusters)
- Necessary for training very large models (100B+ parameters)
- Requires careful management of distributed workloads and failures

Optimizing the Inference Stack

Model selection: Choose models aligned with enterprise data and use cases
- Supervised fine-tuning, data programming, reinforcement learning
Inference engine optimizations:
- Model sharding, disaggregated serving, custom inference kernels
- Cross-server serving with fast interconnects (e.g. AWS EFA)
- Speculative decoding to improve throughput
Serving infrastructure:
- Prompt caching, session affinity for reliability
- Request failover and load shedding for spiky traffic
- Global secure serving with private VPCs and network isolation

Building Agentic AI Systems

Principle 1: Treat models as IP, not commodities
- Build models using enterprise data flywheel
- Tightly integrate models and applications
Principle 2: Own the full AI stack
- Customize hardware, software, and runtime layers
- Achieve 10+ trillion tokens per day of inference throughput

Real-World Results

Fireworks platform:
- Serves 150,000 requests per second
- Processes over 13 trillion tokens per day
- Runs on private, secure cloud infrastructure

Your Digital Journey deserves a great story.

Build one with us.

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.