TalksAWS re:Invent 2025 - Break through AI performance and cost barriers with AWS Trainium (AIM201)

AWS re:Invent 2025 - Break through AI performance and cost barriers with AWS Trainium (AIM201)

AWS re:Invent 2025 - Break through AI performance and cost barriers with AWS Trainium (AIM201)

Overview

  • This presentation discussed the latest innovations in AWS Trainium, the company's custom-built machine learning chip, and how it addresses the growing demands of generative AI and agentic workflows.
  • The key focus areas were the trends driving AI workloads, the technical innovations in Trainium 3, and how customers are leveraging this technology to build advanced applications.

Industry Trends Driving AI Workloads

  1. Generative AI: Customers are rapidly adopting generative AI models that can not only generate responses, but also verify them against business requirements. This enables turning the model outputs into actionable steps.
  2. Domain-Specific Models: There is a shift towards training application or domain-specific models, which provide much greater accuracy compared to general-purpose models. This is enabled by cheaper model training.
  3. Agentic Workflows: AI systems are evolving beyond simple prompt-response interactions to become agents that can take actions, execute code, and interact with real-world environments to achieve specific tasks.

These trends drive key requirements for the underlying infrastructure:

  • High-performance token generation to handle long reasoning models, agent-based workflows, and increased demand
  • Low latency to support interactive agent-human workflows
  • Cost-effective and easily accessible accelerators to meet the scale and flexibility needs
  • The ability to support a wide range of model sizes, from small domain-specific to large foundation models

Trainium Innovation Journey

  • AWS has a long history of silicon innovation, starting with the Nitro card for network and storage offloading, followed by the Graviton ARM-based CPUs and the Inferentia and Trainium machine learning chips.
  • Trainium 2, announced last year, saw significant advancements in performance and scale, with a 500,000-chip deployment in a single data center.
  • Trainium 3 builds on these learnings, providing 4.4x more compute and 3.9x more memory bandwidth compared to Trainium 2.

Key Innovations in Trainium 3

  1. End-to-End Hardware Integration: Trainium 3 integrates the Nitro, Graviton, and Trainium technologies, enabling better power management, assembly, and supply chain optimization.
  2. High-Performance Data Types: Trainium 3 supports seamless quantization to lower data types (e.g., BF16) without sacrificing accuracy, providing up to 2x performance improvements.
  3. Neuron Switch: This innovation enables much more direct one-to-one communication between Trainium chips, reducing latency in key primitives like All-Gather and All-Reduce by up to 6x and 2x, respectively.

Performance and Cost Benefits

  • Trainium 3 delivers a 5x improvement in "tokens per megawatt", significantly reducing the operational costs of running large-scale AI workloads.
  • Customers can choose to either serve the same number of users with 4.5x higher throughput or serve 6x more users with the same throughput, depending on their application requirements.

Customer Experiences

  1. Poolside: A company that trains foundation models from scratch, Poolside has found Trainium to be a crucial partner in optimizing their inference workloads, which are a significant portion of their compute needs. They have seen up to 2x performance improvements on Trainium 3 compared to state-of-the-art GPUs.
  2. DART AI: DART AI has developed real-time video-to-video diffusion models that can generate interactive, frame-by-frame edited videos. They have achieved an 80% utilization of the Trainium 3 tensor engines, enabled by the chip's architectural features like centralized SRAM and fine-grained control through the Nicki interface.

Developer Experience and Ecosystem

  • AWS is committed to making Trainium accessible to developers across different skill levels and use cases:
    • Machine Learning Developers: Deep integrations with frameworks like PyTorch, TensorFlow, and Hugging Face.
    • Machine Learning Researchers: PyTorch native support, including eager execution, distributed training, and Torch Compile.
    • Performance Engineers: Direct access to the Nicki kernel interface and a profiler for low-level optimizations.
  • The entire Neuron software stack, including the Nicki kernel interface and library, will be open-sourced to foster community collaboration and transparency.

Future Outlook

  • AWS announced the upcoming Trainium 4, continuing the company's commitment to iterative hardware innovation to meet the evolving needs of the AI ecosystem.
  • The presentation highlighted the breadth of ML-focused sessions available at AWS re:Invent 2025, encouraging attendees to explore the latest advancements in this space.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.