AWS re:Invent 2025 - Break through AI performance and cost barriers with AWS Trainium (AIM201)

Overview

This presentation discussed the latest innovations in AWS Trainium, the company's custom-built machine learning chip, and how it addresses the growing demands of generative AI and agentic workflows.

The key focus areas were the trends driving AI workloads, the technical innovations in Trainium 3, and how customers are leveraging this technology to build advanced applications.

Industry Trends Driving AI Workloads

Generative AI: Customers are rapidly adopting generative AI models that can not only generate responses, but also verify them against business requirements. This enables turning the model outputs into actionable steps.

Domain-Specific Models: There is a shift towards training application or domain-specific models, which provide much greater accuracy compared to general-purpose models. This is enabled by cheaper model training.

Agentic Workflows: AI systems are evolving beyond simple prompt-response interactions to become agents that can take actions, execute code, and interact with real-world environments to achieve specific tasks.

These trends drive key requirements for the underlying infrastructure:

High-performance token generation to handle long reasoning models, agent-based workflows, and increased demand

Low latency to support interactive agent-human workflows

Cost-effective and easily accessible accelerators to meet the scale and flexibility needs

The ability to support a wide range of model sizes, from small domain-specific to large foundation models

Trainium Innovation Journey

AWS has a long history of silicon innovation, starting with the Nitro card for network and storage offloading, followed by the Graviton ARM-based CPUs and the Inferentia and Trainium machine learning chips.

Trainium 2, announced last year, saw significant advancements in performance and scale, with a 500,000-chip deployment in a single data center.

Trainium 3 builds on these learnings, providing 4.4x more compute and 3.9x more memory bandwidth compared to Trainium 2.

Key Innovations in Trainium 3

End-to-End Hardware Integration: Trainium 3 integrates the Nitro, Graviton, and Trainium technologies, enabling better power management, assembly, and supply chain optimization.

High-Performance Data Types: Trainium 3 supports seamless quantization to lower data types (e.g., BF16) without sacrificing accuracy, providing up to 2x performance improvements.

Neuron Switch: This innovation enables much more direct one-to-one communication between Trainium chips, reducing latency in key primitives like All-Gather and All-Reduce by up to 6x and 2x, respectively.

Performance and Cost Benefits

Trainium 3 delivers a 5x improvement in "tokens per megawatt", significantly reducing the operational costs of running large-scale AI workloads.

Customers can choose to either serve the same number of users with 4.5x higher throughput or serve 6x more users with the same throughput, depending on their application requirements.

Customer Experiences

Poolside: A company that trains foundation models from scratch, Poolside has found Trainium to be a crucial partner in optimizing their inference workloads, which are a significant portion of their compute needs. They have seen up to 2x performance improvements on Trainium 3 compared to state-of-the-art GPUs.

DART AI: DART AI has developed real-time video-to-video diffusion models that can generate interactive, frame-by-frame edited videos. They have achieved an 80% utilization of the Trainium 3 tensor engines, enabled by the chip's architectural features like centralized SRAM and fine-grained control through the Nicki interface.

Developer Experience and Ecosystem

AWS is committed to making Trainium accessible to developers across different skill levels and use cases:

Machine Learning Developers: Deep integrations with frameworks like PyTorch, TensorFlow, and Hugging Face.
Machine Learning Researchers: PyTorch native support, including eager execution, distributed training, and Torch Compile.
Performance Engineers: Direct access to the Nicki kernel interface and a profiler for low-level optimizations.

The entire Neuron software stack, including the Nicki kernel interface and library, will be open-sourced to foster community collaboration and transparency.

Future Outlook

AWS announced the upcoming Trainium 4, continuing the company's commitment to iterative hardware innovation to meet the evolving needs of the AI ecosystem.

The presentation highlighted the breadth of ML-focused sessions available at AWS re:Invent 2025, encouraging attendees to explore the latest advancements in this space.

AWS re:Invent 2025 - Break through AI performance and cost barriers with AWS Trainium (AIM201)