Talks AWS re:Invent 2025 - High-performance storage for AI/ML, analytics, and HPC workloads (STG336) VIDEO
AWS re:Invent 2025 - High-performance storage for AI/ML, analytics, and HPC workloads (STG336) High-Performance Storage for AI/ML, Analytics, and HPC Workloads
Overview
Presentation by Aditi, Manish, and Mark from Amazon on high-performance storage solutions for compute-intensive, data-intensive workloads
Covers use cases across machine learning, data analytics, and scientific computing
Discusses the importance of storage performance keeping up with compute power to achieve linear scaling and optimal resource utilization
Challenges with Storage Bottlenecks
Compute-intensive workloads require hundreds/thousands of CPU/GPU cores
These workloads are also data-intensive, requiring fast, reliable access to massive datasets
If storage cannot keep up with compute performance, it becomes the bottleneck, leading to underutilized compute resources and increased costs
Lift-and-Shift File System Customers
Many customers prefer file system interfaces for familiarity, granular permissions, and consistent data access
AWS launched Amazon FSx for Lustre in 2018 to provide a fully managed, elastic, high-performance file system
Key features:
Fully managed: Automatic monitoring, hardware replacement, and software updates
Elastic: Automatically scales storage capacity up and down based on usage
High-performance: Delivers over 1 TB/s of aggregate throughput and sub-millisecond latencies
Optimizing FSx for Lustre Performance
Scalable metadata servers for metadata-intensive workloads
Support for Elastic Fabric Adapter (EFA) and NVIDIA GPU Direct Storage for low-latency, high-throughput access
Client-side caching to further reduce latencies
Real-World Example: Shell
Shell had GPU-based on-premises environment with infrastructure bottlenecks
Burst into the cloud using FSx for Lustre and EC2
Able to increase GPU utilization from <90% to 100% by eliminating storage bottlenecks
S3 Data Lake Customers
Many customers have petabytes of data stored in Amazon S3 data lakes
Need to ensure storage performance keeps up with compute to avoid wasting expensive CPU/GPU resources
Amazon S3 Express One Zone
Purpose-built for performance-critical applications
Delivers 10x faster access and 80% lower request costs than S3 Standard
Optimized for both latency-sensitive (small payloads) and throughput-intensive (large payloads) workloads
Latency Optimization Techniques
Co-locating compute instances with S3 Express One Zone directory buckets to reduce network hops
Session-based authentication to avoid latency from IAM authorization
Throughput Optimization Techniques
Parallelization by opening multiple connections and using byte-range gets/multipart uploads
AWS Common Runtime (CRT) libraries to enable efficient utilization of high-bandwidth network interfaces
Real-World Example: Metafair
Metafair, a research organization, used S3 Express One Zone to speed up checkpointing and data loading for LLM training
Achieved 140 Tb/s throughput and over 1 million TPS on a 60 PB dataset
Integrating High-Performance S3 Access
PyTorch connector with built-in high-performance data loading and checkpointing
S3 Analytics Accelerator to optimize metadata access for Parquet datasets
SageMaker Fast File Mode to stream data directly from S3 Express One Zone
Bridging File and Object Storage
FSx for Lustre can be connected to an S3 bucket, providing a file system interface to access S3 data
Bi-directional synchronization between the file system and S3 bucket
Enables customers to leverage the benefits of both file and object storage
Real-World Example: LG AI Research
LG AI Research used SageMaker and FSx for Lustre connected to an S3 bucket to accelerate training of a foundational AI model
Leveraged file system interface for performance, while benefiting from S3 for long-term storage
Your Digital Journey deserves a great story. Build one with us.