TalksAWS re:Invent 2025 - High-performance storage for AI/ML, analytics, and HPC workloads (STG336)

AWS re:Invent 2025 - High-performance storage for AI/ML, analytics, and HPC workloads (STG336)

High-Performance Storage for AI/ML, Analytics, and HPC Workloads

Overview

  • Presentation by Aditi, Manish, and Mark from Amazon on high-performance storage solutions for compute-intensive, data-intensive workloads
  • Covers use cases across machine learning, data analytics, and scientific computing
  • Discusses the importance of storage performance keeping up with compute power to achieve linear scaling and optimal resource utilization

Challenges with Storage Bottlenecks

  • Compute-intensive workloads require hundreds/thousands of CPU/GPU cores
  • These workloads are also data-intensive, requiring fast, reliable access to massive datasets
  • If storage cannot keep up with compute performance, it becomes the bottleneck, leading to underutilized compute resources and increased costs

Lift-and-Shift File System Customers

  • Many customers prefer file system interfaces for familiarity, granular permissions, and consistent data access
  • AWS launched Amazon FSx for Lustre in 2018 to provide a fully managed, elastic, high-performance file system
  • Key features:
    • Fully managed: Automatic monitoring, hardware replacement, and software updates
    • Elastic: Automatically scales storage capacity up and down based on usage
    • High-performance: Delivers over 1 TB/s of aggregate throughput and sub-millisecond latencies

Optimizing FSx for Lustre Performance

  • Scalable metadata servers for metadata-intensive workloads
  • Support for Elastic Fabric Adapter (EFA) and NVIDIA GPU Direct Storage for low-latency, high-throughput access
  • Client-side caching to further reduce latencies

Real-World Example: Shell

  • Shell had GPU-based on-premises environment with infrastructure bottlenecks
  • Burst into the cloud using FSx for Lustre and EC2
  • Able to increase GPU utilization from <90% to 100% by eliminating storage bottlenecks

S3 Data Lake Customers

  • Many customers have petabytes of data stored in Amazon S3 data lakes
  • Need to ensure storage performance keeps up with compute to avoid wasting expensive CPU/GPU resources

Amazon S3 Express One Zone

  • Purpose-built for performance-critical applications
  • Delivers 10x faster access and 80% lower request costs than S3 Standard
  • Optimized for both latency-sensitive (small payloads) and throughput-intensive (large payloads) workloads

Latency Optimization Techniques

  • Co-locating compute instances with S3 Express One Zone directory buckets to reduce network hops
  • Session-based authentication to avoid latency from IAM authorization

Throughput Optimization Techniques

  • Parallelization by opening multiple connections and using byte-range gets/multipart uploads
  • AWS Common Runtime (CRT) libraries to enable efficient utilization of high-bandwidth network interfaces

Real-World Example: Metafair

  • Metafair, a research organization, used S3 Express One Zone to speed up checkpointing and data loading for LLM training
  • Achieved 140 Tb/s throughput and over 1 million TPS on a 60 PB dataset

Integrating High-Performance S3 Access

  • PyTorch connector with built-in high-performance data loading and checkpointing
  • S3 Analytics Accelerator to optimize metadata access for Parquet datasets
  • SageMaker Fast File Mode to stream data directly from S3 Express One Zone

Bridging File and Object Storage

  • FSx for Lustre can be connected to an S3 bucket, providing a file system interface to access S3 data
  • Bi-directional synchronization between the file system and S3 bucket
  • Enables customers to leverage the benefits of both file and object storage

Real-World Example: LG AI Research

  • LG AI Research used SageMaker and FSx for Lustre connected to an S3 bucket to accelerate training of a foundational AI model
  • Leveraged file system interface for performance, while benefiting from S3 for long-term storage

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.