High performance distributed model training with Amazon SageMaker (AIM380)

The Explosion of Compute and Data for Foundational Models

The compute needed for training foundational models has been growing exponentially, from petaflops in 2010 to thousands of yottaflops today.

The amount of training data has also been doubling every 8 months, raising concerns about the scalability of internet-scale data.

These trends have led to unique challenges in managing the infrastructure for large-scale distributed training.

Optimizing the Distributed Training Stack

The distributed training stack consists of multiple layers, from the accelerator devices to the distributed training frameworks.

Amazon SageMaker offers two key optimizations:

SageMaker Model Parallelism, a PyTorch fork that provides composable training techniques.
SageMaker HyperPod Recipes, which abstract away the complexity of running large-scale training workloads.

SageMaker HyperPod Recipes

HyperPod Recipes provide optimized configurations for pre-training and fine-tuning large language models on AWS SageMaker infrastructure.

With a single line of code, users can seamlessly switch between GPU and Trinium workloads and access best-in-class training configurations.

The recipes abstract away the complex task of optimizing model configurations, allowing users to focus on their differentiated model training and data.

Salesforce's Experience with SageMaker HyperPod

Salesforce AI Research has been using SageMaker HyperPod to train state-of-the-art models, including:

XGen Sales LLM, a sales-focused language model
SFR Llama Rank, a retrieval-augmented generation model
SFR Judge, a reward model for evaluating model outputs

Salesforce has been able to scale up their models and take advantage of HyperPod's performance and resilience.

The Future of AI Research and Enterprise Applications

Salesforce foresees upcoming challenges and opportunities in the AI landscape, such as:

Scaling up post-training data and compute
Developing reasoning-focused language models with long-running inference
Building multi-agent AI systems that integrate specialized components and interactive assistants.

Overall, the presentation highlights how Amazon SageMaker HyperPod and its Recipes can simplify and optimize the distributed training of large language models, enabling enterprises like Salesforce to push the boundaries of AI research and development.

High performance distributed model training with Amazon SageMaker (AIM380)

Key Takeaways on Distributed Training on Amazon SageMaker

The Explosion of Compute and Data for Foundational Models

Optimizing the Distributed Training Stack

SageMaker HyperPod Recipes

Salesforce's Experience with SageMaker HyperPod

The Future of AI Research and Enterprise Applications

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

High performance distributed model training with Amazon SageMaker (AIM380)

Key Takeaways on Distributed Training on Amazon SageMaker

The Explosion of Compute and Data for Foundational Models

Optimizing the Distributed Training Stack

SageMaker HyperPod Recipes

Salesforce's Experience with SageMaker HyperPod

The Future of AI Research and Enterprise Applications

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.