Talks AWS re:Invent 2025 - How customers build AI at scale with AWS AI infrastructure (AIM252) VIDEO
AWS re:Invent 2025 - How customers build AI at scale with AWS AI infrastructure (AIM252) Scaling AI at Enterprise Scale: Lessons from AWS re:Invent 2025
Scaling AI Infrastructure: Challenges and Solutions
Scaling AI workloads presents unique challenges at enterprise scale:
Hardware faults become more common with thousands of GPUs
Networking issues can cause hard-to-debug errors and performance problems
Model implementation details become critical for stability and efficiency
Key strategies for addressing these challenges:
Rigorously test nodes and remove faulty hardware
Minimize network hops and cross-zone/cross-spine traffic
Carefully test and upgrade all software components together
Optimize data loading and checkpointing to maximize GPU utilization
Scaling Up vs. Scaling Out
Scaling up: Upgrading to more powerful GPU instances (e.g. P6, H100, Trinium 3)
Allows homogenizing hardware and simplifying management
Can provide significant performance improvements for smaller models
Scaling out: Adding more servers and clusters (e.g. AWS Ultra Clusters)
Necessary for training very large models (100B+ parameters)
Requires careful management of distributed workloads and failures
Optimizing the Inference Stack
Model selection: Choose models aligned with enterprise data and use cases
Supervised fine-tuning, data programming, reinforcement learning
Inference engine optimizations:
Model sharding, disaggregated serving, custom inference kernels
Cross-server serving with fast interconnects (e.g. AWS EFA)
Speculative decoding to improve throughput
Serving infrastructure:
Prompt caching, session affinity for reliability
Request failover and load shedding for spiky traffic
Global secure serving with private VPCs and network isolation
Building Agentic AI Systems
Principle 1: Treat models as IP, not commodities
Build models using enterprise data flywheel
Tightly integrate models and applications
Principle 2: Own the full AI stack
Customize hardware, software, and runtime layers
Achieve 10+ trillion tokens per day of inference throughput
Real-World Results
Fireworks platform:
Serves 150,000 requests per second
Processes over 13 trillion tokens per day
Runs on private, secure cloud infrastructure
Your Digital Journey deserves a great story. Build one with us.