TalksAWS re:Invent 2025 - HPC at Scale with AWS Parallel Computing Service (PCS) (CMP340)

AWS re:Invent 2025 - HPC at Scale with AWS Parallel Computing Service (PCS) (CMP340)

AWS re:Invent 2025 - HPC at Scale with AWS Parallel Computing Service (PCS)

Overview of HPC at AWS

  • AWS has been working on HPC capabilities since 2015, starting with the open-source Cloud Formation Cluster (CFN) toolkit
  • Customers wanted a more fully-managed HPC service from AWS, leading to the development of AWS Parallel Cluster (PC) and eventually AWS Parallel Computing Service (PCS)
  • PCS is designed to address the needs of demanding HPC customers like Shell and Toyota, who were initially skeptical about running HPC workloads on AWS

What is AWS Parallel Computing Service (PCS)?

  • PCS is a managed Slurm offering, providing a fully-managed HPC-as-a-service solution
  • Slurm was chosen as the initial scheduler due to its popularity in open-source and academia, especially for large language model training and AI workloads
  • PCS allows customers to focus on their scientific workloads, research, and simulations, while AWS handles the underlying infrastructure and operations

Key Features and Benefits of PCS

  • Managed Slurm scheduler, allowing dynamic scaling and scheduling of compute resources
  • Seamless infrastructure-as-code and API-driven development, reducing the need for manual cluster management
  • Integrated with AWS services like CloudWatch for observability and cost optimization
  • Flexible architecture supporting CPUs, GPUs, and various storage options
  • Designed to meet the needs of multiple stakeholders: HPC system administrators, scientists, and engineers

Architectural Overview of PCS

  • PCS clusters consist of login nodes, compute node groups, and queues that can be configured to schedule jobs across different instance types
  • The service follows a shared responsibility model, where AWS manages the controller and updates, while customers manage their VPC, compute nodes, and workloads
  • Customers can purchase PCS resources using On-Demand, Spot, or a combination, depending on their needs

Pricing and Availability

  • PCS pricing includes a fee for the cluster controller and Slurm accounting, in addition to the standard EC2 instance costs
  • PCS is currently available in select AWS regions, but the plan is to expand it globally by the end of 2026

Customer Adoption and Success Stories

Toyota Central R&D Labs

  • Toyota faced challenges with managing their on-premises HPC environment, including long lead times for adding new resources and inefficient resource utilization
  • By adopting PCS, Toyota was able to:
    • Reduce environment setup time from 6 weeks to 30 minutes
    • Quickly accommodate requests for advanced compute resources like R7 48xR and P4D 24xA100 instances
    • Improve overall utilization and cost optimization through dynamic scaling

Shell

  • Shell initially had concerns about the performance, security, and cost-effectiveness of running HPC workloads on AWS
  • After a long journey, Shell was able to:
    • Achieve a 2.5x acceleration of critical path projects by leveraging PCS and burst capacity
    • Seamlessly integrate PCS with their existing Slurm-based workflows
    • Benefit from the flexibility and scalability of PCS, allowing them to iterate faster on their HPC solutions

Future Developments and Integrations

  • AWS announced the upcoming availability of the latest AMD EPYC Trento processors in the HPC 8A instance family
  • AWS is also investing $50 billion to ensure the latest HPC and AI resources are available in their GovCloud and classified regions

Key Takeaways

  • PCS provides a fully-managed HPC-as-a-service solution, allowing customers to focus on their scientific workloads while AWS handles the underlying infrastructure
  • Customers like Toyota and Shell have seen significant benefits in terms of reduced setup time, improved resource utilization, and accelerated innovation cycles by adopting PCS
  • PCS offers a flexible and scalable architecture, supporting a variety of compute and storage options, and is designed to meet the needs of multiple stakeholders within HPC organizations
  • AWS is continuously investing in and expanding its HPC capabilities, including the upcoming availability of the latest AMD EPYC processors and a $50 billion investment in HPC resources for government and classified workloads

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.