TalksAWS re:Invent 2025 - High-performance storage for AI/ML, analytics, and HPC workloads (STG336)

AWS re:Invent 2025 - High-performance storage for AI/ML, analytics, and HPC workloads (STG336)

High-Performance Storage for AI/ML, Analytics, and HPC Workloads

Overview

Presentation by Aditi, Manish, and Mark from Amazon on high-performance storage solutions for compute-intensive, data-intensive workloads
Covers use cases across machine learning, data analytics, and scientific computing
Discusses the importance of storage performance keeping up with compute power to achieve linear scaling and optimal resource utilization

Challenges with Storage Bottlenecks

Compute-intensive workloads require hundreds/thousands of CPU/GPU cores
These workloads are also data-intensive, requiring fast, reliable access to massive datasets
If storage cannot keep up with compute performance, it becomes the bottleneck, leading to underutilized compute resources and increased costs

Lift-and-Shift File System Customers

Many customers prefer file system interfaces for familiarity, granular permissions, and consistent data access
AWS launched Amazon FSx for Lustre in 2018 to provide a fully managed, elastic, high-performance file system
Key features:
- Fully managed: Automatic monitoring, hardware replacement, and software updates
- Elastic: Automatically scales storage capacity up and down based on usage
- High-performance: Delivers over 1 TB/s of aggregate throughput and sub-millisecond latencies

Optimizing FSx for Lustre Performance

Scalable metadata servers for metadata-intensive workloads
Support for Elastic Fabric Adapter (EFA) and NVIDIA GPU Direct Storage for low-latency, high-throughput access
Client-side caching to further reduce latencies

Real-World Example: Shell

Shell had GPU-based on-premises environment with infrastructure bottlenecks
Burst into the cloud using FSx for Lustre and EC2
Able to increase GPU utilization from <90% to 100% by eliminating storage bottlenecks

S3 Data Lake Customers

Many customers have petabytes of data stored in Amazon S3 data lakes
Need to ensure storage performance keeps up with compute to avoid wasting expensive CPU/GPU resources

Amazon S3 Express One Zone

Purpose-built for performance-critical applications
Delivers 10x faster access and 80% lower request costs than S3 Standard
Optimized for both latency-sensitive (small payloads) and throughput-intensive (large payloads) workloads

Latency Optimization Techniques

Co-locating compute instances with S3 Express One Zone directory buckets to reduce network hops
Session-based authentication to avoid latency from IAM authorization

Throughput Optimization Techniques

Parallelization by opening multiple connections and using byte-range gets/multipart uploads
AWS Common Runtime (CRT) libraries to enable efficient utilization of high-bandwidth network interfaces

Real-World Example: Metafair

Metafair, a research organization, used S3 Express One Zone to speed up checkpointing and data loading for LLM training
Achieved 140 Tb/s throughput and over 1 million TPS on a 60 PB dataset

Integrating High-Performance S3 Access

PyTorch connector with built-in high-performance data loading and checkpointing
S3 Analytics Accelerator to optimize metadata access for Parquet datasets
SageMaker Fast File Mode to stream data directly from S3 Express One Zone

Bridging File and Object Storage

FSx for Lustre can be connected to an S3 bucket, providing a file system interface to access S3 data
Bi-directional synchronization between the file system and S3 bucket
Enables customers to leverage the benefits of both file and object storage

Real-World Example: LG AI Research

LG AI Research used SageMaker and FSx for Lustre connected to an S3 bucket to accelerate training of a foundational AI model
Leveraged file system interface for performance, while benefiting from S3 for long-term storage

Your Digital Journey deserves a great story.

Build one with us.

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

AWS re:Invent 2025 - High-performance storage for AI/ML, analytics, and HPC workloads (STG336)

High-Performance Storage for AI/ML, Analytics, and HPC Workloads

Overview

Challenges with Storage Bottlenecks

Lift-and-Shift File System Customers

Optimizing FSx for Lustre Performance

Real-World Example: Shell

S3 Data Lake Customers

Amazon S3 Express One Zone

Latency Optimization Techniques

Throughput Optimization Techniques

Real-World Example: Metafair

Integrating High-Performance S3 Access

Bridging File and Object Storage

Real-World Example: LG AI Research

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - High-performance storage for AI/ML, analytics, and HPC workloads (STG336)

High-Performance Storage for AI/ML, Analytics, and HPC Workloads

Overview

Challenges with Storage Bottlenecks

Lift-and-Shift File System Customers

Optimizing FSx for Lustre Performance

Real-World Example: Shell

S3 Data Lake Customers

Amazon S3 Express One Zone

Latency Optimization Techniques

Throughput Optimization Techniques

Real-World Example: Metafair

Integrating High-Performance S3 Access

Bridging File and Object Storage

Real-World Example: LG AI Research

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.