AWS re:Invent 2025 - AWS Trn3 UltraServers: Power next-generation enterprise AI performance(AIM3335)

Summary of AWS re:Invent 2025 - AWS Trn3 UltraServers: Power next-generation enterprise AI performance

The Tectonic Shift of AI

AI is enabling a fundamental transformation in how we build, deploy, and interact with the world

AI is driving breakthroughs across scientific domains like protein biology, mathematics, and software engineering

AI is becoming the engine powering scientific discovery and innovation

Building the AI Infrastructure of the Future

AWS has built the most comprehensive and deeply integrated AI stack to power the next generation of AI

Key components include:

Compute: Broad portfolio of GPU instances and latest Inferentia and Trainium instances
Network: Ultra-clusters with low-latency, low-jitter Elastic Fabric Adapter
Storage: High-throughput options like FSx for Lustre and S3 Express
Security: Nitro system for workload isolation and data protection
Management: Observability tools like CloudWatch

Trainium 3: Powering Next-Gen AI Workloads

Trainium 3 is designed to support the evolving needs of AI, including:

Longer context lengths for reasoning models
Mixture of expert models with communication-heavy architectures
Infrastructure for pre-training, post-training, and inference
High batch size, high throughput systems for concurrent agent-based workloads

Key Trainium 3 specifications:

360 PFLOPS of microscaled FP8 compute (4.4x more than Trainium 2)
20TB of HBM capacity (3.4x more) and 700TB/s of HBM bandwidth (3.9x more)
2x faster interconnect with new Neuron switches for low-latency, high-bandwidth communication

Optimizing Trainium 3 for Performance

Focused on achieving maximum sustained performance, not just peak numbers

Innovations include:

Microscaling for efficient low-precision quantization without accuracy loss
Accelerated softmax instructions to keep tensor engines fully utilized
Extensive co-optimizations across the entire hardware and software stack

Scaling Trainium 3 for Massive Workloads

Trainium 3 designed for rapid scalability from the ground up

Modular, top-accessible, and robotically assembled compute sleds enable fast deployment

Leverages AWS's expertise in building massive AI clusters, like the 1 million-chip Project Trineir

Ease of Use for Different User Personas

ML Developers: Deep integration with popular ML frameworks and pre-optimized models

Researchers: PyTorch-native support with eager execution and automatic optimizations

Performance Engineers: Low-level Neuron Kernel Interface (NKI) and Neuron Explorer profiling tools

The Road Ahead: Trainium 4

Targeting 6x FP4 performance uplift, 4x memory bandwidth, and 2x memory capacity compared to Trainium 3

Continued focus on energy efficiency and end-to-end optimizations

Key Takeaways

Trainium 3 is designed to power the next generation of large-scale, complex AI workloads

AWS has invested heavily in building a comprehensive AI infrastructure stack, from chips to clusters

Performance optimizations focus on achieving maximum sustained throughput, not just peak numbers

Scalability and ease of use are key priorities, catering to different user personas

Rapid iteration continues, with Trainium 4 promising even greater performance and efficiency

AWS re:Invent 2025 - AWS Trn3 UltraServers: Power next-generation enterprise AI performance(AIM3335)

Summary of AWS re:Invent 2025 - AWS Trn3 UltraServers: Power next-generation enterprise AI performance

The Tectonic Shift of AI

Building the AI Infrastructure of the Future

Trainium 3: Powering Next-Gen AI Workloads

Optimizing Trainium 3 for Performance

Scaling Trainium 3 for Massive Workloads

Ease of Use for Different User Personas

The Road Ahead: Trainium 4

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - AWS Trn3 UltraServers: Power next-generation enterprise AI performance(AIM3335)

Summary of AWS re:Invent 2025 - AWS Trn3 UltraServers: Power next-generation enterprise AI performance

The Tectonic Shift of AI

Building the AI Infrastructure of the Future

Trainium 3: Powering Next-Gen AI Workloads

Optimizing Trainium 3 for Performance

Scaling Trainium 3 for Massive Workloads

Ease of Use for Different User Personas

The Road Ahead: Trainium 4

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.