TalksAWS re:Invent 2025 - Performance engineering on Neuron: How to optimize your LLM with NKI (AIM414)

AWS re:Invent 2025 - Performance engineering on Neuron: How to optimize your LLM with NKI (AIM414)

Optimizing Large Language Models with AWS Neuron and Nikki

Overview

This session from AWS re:Invent 2025 showcases how to leverage the AWS Neuron SDK and the Neuron Kernel Interface (Nikki) to optimize the performance of large language models (LLMs) running on AWS Trainium, the latest generation of AWS's custom ML accelerator chips.

AWS Trainium and Neuron

  • AWS Trainium is the second generation of AWS's custom ML accelerator chips, offering more powerful hardware capabilities compared to the previous Inferentia chips.
  • The core of Trainium and Inferentia is the Neuron core, which contains specialized compute engines (tensor, vector, scalar, general-purpose) and on-chip memory (SRAM, HBM) to accelerate ML workloads.
  • The Neuron SDK provides a software stack, including a compiler, runtime, and user tools, to enable customers to leverage the Neuron hardware.

Roofline Model and Optimization Strategies

  • The roofline model illustrates the relationship between a workload's arithmetic intensity (ops/byte) and the hardware's memory bandwidth and compute throughput.
  • To optimize performance, the goal is to shift the workload's achieved arithmetic intensity to the right side of the roofline, becoming compute-bound rather than memory-bound.
  • Key optimization strategies include:
    • Pipelining operations
    • Minimizing data movement
    • Maximizing data throughput
    • Overlapping data movement with collective communication (in distributed training)

Neuron Kernel Interface (Nikki)

  • Nikki is a Python-based domain-specific language that allows developers to write low-level kernels to run on Neuron hardware.
  • Nikki provides direct access to the Neuron ISA (instruction set architecture), enabling fine-tuned optimization of models to take full advantage of the Trainium hardware.
  • Nikki integrates with popular ML frameworks like PyTorch and JAX, allowing developers to easily incorporate Nikki kernels into their existing models.

Optimizing Transformer-based Models with Nikki

  • The session focuses on optimizing the attention mechanism, a critical component of transformer-based models like Quant-3.
  • The baseline Quant-3 model, when compiled to run on Trainium using the Neuron SDK, achieves a throughput of only 0.35 prompts per second and a latency of around 3 seconds.
  • By replacing the default attention implementation with a Nikki-based attention kernel, the team was able to achieve a 6-8x performance improvement, reaching 2.99 prompts per second with a latency of around 0.25 seconds.
  • The Nikki attention kernel leverages various optimization techniques, including:
    • Efficient memory management and DMA usage
    • Compute engine pipelining and op fusion
    • Optimized memory layouts and tiling

Nikki Library and Future Outlook

  • The Nikki library, launched at re:Invent 2025, provides a set of pre-optimized Nikki kernels developed and maintained by the AWS team.
  • These kernels cover a range of common model components, including dense layers, and will be expanded to support additional use cases and workloads.
  • By providing access to these low-level hardware optimizations, Nikki empowers customers to significantly improve the performance of their ML models running on AWS Trainium.

Key Takeaways

  • AWS Trainium and the Neuron SDK offer powerful hardware and software capabilities to accelerate ML workloads, including LLMs.
  • The Neuron Kernel Interface (Nikki) allows developers to write highly optimized kernels that leverage the specialized compute engines and memory hierarchy of Trainium.
  • Integrating Nikki kernels into existing models can lead to substantial performance improvements, as demonstrated by the 6-8x throughput increase for the Quant-3 model.
  • The newly launched Nikki library provides a growing collection of pre-optimized kernels that customers can leverage to boost the performance of their ML models on AWS Trainium.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.