Talks AWS re:Invent 2025 - FINRA: Accelerate Massive Data Processing with NVIDIA on AWS EMR (AIM279) VIDEO
AWS re:Invent 2025 - FINRA: Accelerate Massive Data Processing with NVIDIA on AWS EMR (AIM279) AWS re:Invent 2025 - FINRA: Accelerate Massive Data Processing with NVIDIA on AWS EMR
Introduction to GPU Acceleration for Apache Spark
Apache Spark is a widely-used data processing framework across enterprises and organizations
Data volumes have been growing exponentially in recent years, especially for AI and machine learning workloads
To handle this data growth, enterprises are turning to GPU acceleration to speed up Spark workloads
The GPU Acceleration Stack
The solution introduces a plugin that can be easily integrated into existing Spark workflows
This plugin leverages the NVIDIA GPU layer running on the cloud or on-premises
This GPU acceleration can provide significant performance improvements and cost savings
Real-World Use Cases
Companies across various industries have publicly shared their success with this GPU-accelerated Spark technology
One example is in fraud detection, where billions of records need to be processed using time series analysis - an ideal workload for GPUs
This resulted in a 14x speedup and 90% cost savings compared to CPU-only processing
FINRA's Journey with GPU-Accelerated Spark
Background on FINRA
FINRA is a not-for-profit organization responsible for market integrity and investor protection
They operate over 1 PB of storage in the AWS cloud, processing massive datasets for regulatory compliance and fraud detection
Evaluating GPU Acceleration
FINRA initially used Apache Hive for their SQL queries, then transitioned to Apache Spark
When introduced to GPU-accelerated Spark, they ran tests on the TPCDS 9B benchmark
This resulted in a 50% performance improvement and 50% cost reduction compared to CPU-only Spark
Applying to Production Workloads
FINRA then applied the GPU-accelerated Spark to their production trading application workloads
Again, they saw around 50% performance improvements and 45% cost reductions
However, the initial GPU runs were not optimal, requiring collaboration with NVIDIA to identify and resolve bottlenecks
Integrating GPU Spark into the Data Pipeline
FINRA's data pipeline involves decompressing, type conversion, and parquet conversion of 100,000 daily CSV files
By transitioning this pipeline to use GPU-accelerated Spark, they achieved consistent runtime and cost savings
This required some code changes to leverage Spark DataFrames instead of the less GPU-friendly Dataset API
Lessons Learned and the Path Forward
Not every workload will see immediate benefits from GPU acceleration - identifying bottlenecks is crucial
FINRA has established a process to validate CPU vs GPU performance for their workloads
While CPU remains the default, GPU acceleration is now a strategic part of FINRA's big data technology stack for the future
Key Takeaways
GPU acceleration can provide significant performance improvements and cost savings for large-scale Spark workloads
Integrating GPU-accelerated Spark requires some upfront effort to identify and resolve bottlenecks, but can lead to transformative results
FINRA's experience demonstrates the real-world business impact of this technology, from regulatory compliance to fraud detection
As GPU hardware and software continue to evolve, GPU acceleration is becoming a strategic part of enterprise big data architectures
Your Digital Journey deserves a great story. Build one with us.