TalksAWS re:Invent 2025 - HPC at Scale with AWS Parallel Computing Service (PCS) (CMP340)
AWS re:Invent 2025 - HPC at Scale with AWS Parallel Computing Service (PCS) (CMP340)
AWS re:Invent 2025 - HPC at Scale with AWS Parallel Computing Service (PCS)
Overview of HPC at AWS
AWS has been working on HPC capabilities since 2015, starting with the open-source Cloud Formation Cluster (CFN) toolkit
Customers wanted a more fully-managed HPC service from AWS, leading to the development of AWS Parallel Cluster (PC) and eventually AWS Parallel Computing Service (PCS)
PCS is designed to address the needs of demanding HPC customers like Shell and Toyota, who were initially skeptical about running HPC workloads on AWS
What is AWS Parallel Computing Service (PCS)?
PCS is a managed Slurm offering, providing a fully-managed HPC-as-a-service solution
Slurm was chosen as the initial scheduler due to its popularity in open-source and academia, especially for large language model training and AI workloads
PCS allows customers to focus on their scientific workloads, research, and simulations, while AWS handles the underlying infrastructure and operations
Key Features and Benefits of PCS
Managed Slurm scheduler, allowing dynamic scaling and scheduling of compute resources
Seamless infrastructure-as-code and API-driven development, reducing the need for manual cluster management
Integrated with AWS services like CloudWatch for observability and cost optimization
Flexible architecture supporting CPUs, GPUs, and various storage options
Designed to meet the needs of multiple stakeholders: HPC system administrators, scientists, and engineers
Architectural Overview of PCS
PCS clusters consist of login nodes, compute node groups, and queues that can be configured to schedule jobs across different instance types
The service follows a shared responsibility model, where AWS manages the controller and updates, while customers manage their VPC, compute nodes, and workloads
Customers can purchase PCS resources using On-Demand, Spot, or a combination, depending on their needs
Pricing and Availability
PCS pricing includes a fee for the cluster controller and Slurm accounting, in addition to the standard EC2 instance costs
PCS is currently available in select AWS regions, but the plan is to expand it globally by the end of 2026
Customer Adoption and Success Stories
Toyota Central R&D Labs
Toyota faced challenges with managing their on-premises HPC environment, including long lead times for adding new resources and inefficient resource utilization
By adopting PCS, Toyota was able to:
Reduce environment setup time from 6 weeks to 30 minutes
Quickly accommodate requests for advanced compute resources like R7 48xR and P4D 24xA100 instances
Improve overall utilization and cost optimization through dynamic scaling
Shell
Shell initially had concerns about the performance, security, and cost-effectiveness of running HPC workloads on AWS
After a long journey, Shell was able to:
Achieve a 2.5x acceleration of critical path projects by leveraging PCS and burst capacity
Seamlessly integrate PCS with their existing Slurm-based workflows
Benefit from the flexibility and scalability of PCS, allowing them to iterate faster on their HPC solutions
Future Developments and Integrations
AWS announced the upcoming availability of the latest AMD EPYC Trento processors in the HPC 8A instance family
AWS is also investing $50 billion to ensure the latest HPC and AI resources are available in their GovCloud and classified regions
Key Takeaways
PCS provides a fully-managed HPC-as-a-service solution, allowing customers to focus on their scientific workloads while AWS handles the underlying infrastructure
Customers like Toyota and Shell have seen significant benefits in terms of reduced setup time, improved resource utilization, and accelerated innovation cycles by adopting PCS
PCS offers a flexible and scalable architecture, supporting a variety of compute and storage options, and is designed to meet the needs of multiple stakeholders within HPC organizations
AWS is continuously investing in and expanding its HPC capabilities, including the upcoming availability of the latest AMD EPYC processors and a $50 billion investment in HPC resources for government and classified workloads
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.