Scale FM development with Amazon SageMaker HyperPod (AIM229)

Overview of HyperPod and Generative AI Model Training

The panel discussion focuses on how industry leaders have developed their foundation models and are looking to innovate and deliver to their customers.

Shubha Kumbadakone, a senior manager in the Gen AI org at AWS, leads the go-to-market for the SageMaker HyperPod service, a differentiated Gen AI infrastructure offering.

The panel includes:

Jeff Boudier, Head of Product at Hugging Face
Waseem AlShikh, Co-founder and CTO at Writers
Robert Bakos, Co-founder and CTO at HOPPR

Key Considerations for Building Foundation Models

Data readiness: labeling, data sources, data size, and data modality (text, multimodal)

GPU/Compute requirements: GPU types, GPU hours, and scaling efficiency

Cluster management: job submission, cluster utilization, and infrastructure resiliency

How HyperPod Addresses These Challenges

Enables better training performance by providing a persistent cluster with optimized networking

Supports various distributed training frameworks (FSDP, DeepSpeed) and proprietary SageMaker distributed training libraries

Provides robust resiliency through health checks, automatic node replacement, and checkpoint/restart capabilities

Offers control and visibility through root access, profiling tools, and integration with CloudWatch

Panelists' Experiences with HyperPod

Jeff (Hugging Face): Leverages HyperPod's cluster management capabilities to support open science and model development. Utilizes features like automated GPU management and CO2 consumption tracking.

Waseem (Writers): HyperPod has changed their thinking around cost and resource allocation, allowing them to focus more on model innovation rather than infrastructure management. The resiliency and automation features have been crucial as they scale their models.

Robert (HOPPR): HyperPod's ability to handle large, high-resolution medical imaging data and provide flexible GPU instance types has been valuable for their model training and deployment needs.

Distributed Training Techniques and Inference

The panelists discuss their use of various distributed training libraries and frameworks, such as PyTorch, DeepSpeed, and FSDP, and the flexibility provided by HyperPod to integrate these tools.

For inference, the panelists highlight the benefits of using HyperPod's persistent cluster for both training and inference, as well as the integration with Kubernetes-based solutions for scalable, cost-effective, and secure model deployment.

The AWS-Hugging Face Partnership

The panelists emphasize the value of the partnership between AWS and Hugging Face, which has enabled easier deployment of Hugging Face models, leveraging AWS services and hardware accelerators like Trainium and Inferentia.

The panelists also highlight the technical support and collaborative nature of the partnership, where AWS has helped them overcome challenges and iterate on their solutions.

Scale FM development with Amazon SageMaker HyperPod (AIM229)

Overview of HyperPod and Generative AI Model Training

Key Considerations for Building Foundation Models

How HyperPod Addresses These Challenges

Panelists' Experiences with HyperPod

Distributed Training Techniques and Inference

The AWS-Hugging Face Partnership

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

Scale FM development with Amazon SageMaker HyperPod (AIM229)

Overview of HyperPod and Generative AI Model Training

Key Considerations for Building Foundation Models

How HyperPod Addresses These Challenges

Panelists' Experiences with HyperPod

Distributed Training Techniques and Inference

The AWS-Hugging Face Partnership

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.