TalksAWS re:Invent 2025 - FINRA: Accelerate Massive Data Processing with NVIDIA on AWS EMR (AIM279)

AWS re:Invent 2025 - FINRA: Accelerate Massive Data Processing with NVIDIA on AWS EMR (AIM279)

AWS re:Invent 2025 - FINRA: Accelerate Massive Data Processing with NVIDIA on AWS EMR

Introduction to GPU Acceleration for Apache Spark

  • Apache Spark is a widely-used data processing framework across enterprises and organizations
  • Data volumes have been growing exponentially in recent years, especially for AI and machine learning workloads
  • To handle this data growth, enterprises are turning to GPU acceleration to speed up Spark workloads

The GPU Acceleration Stack

  • The solution introduces a plugin that can be easily integrated into existing Spark workflows
  • This plugin leverages the NVIDIA GPU layer running on the cloud or on-premises
  • This GPU acceleration can provide significant performance improvements and cost savings

Real-World Use Cases

  • Companies across various industries have publicly shared their success with this GPU-accelerated Spark technology
  • One example is in fraud detection, where billions of records need to be processed using time series analysis - an ideal workload for GPUs
  • This resulted in a 14x speedup and 90% cost savings compared to CPU-only processing

FINRA's Journey with GPU-Accelerated Spark

Background on FINRA

  • FINRA is a not-for-profit organization responsible for market integrity and investor protection
  • They operate over 1 PB of storage in the AWS cloud, processing massive datasets for regulatory compliance and fraud detection

Evaluating GPU Acceleration

  • FINRA initially used Apache Hive for their SQL queries, then transitioned to Apache Spark
  • When introduced to GPU-accelerated Spark, they ran tests on the TPCDS 9B benchmark
  • This resulted in a 50% performance improvement and 50% cost reduction compared to CPU-only Spark

Applying to Production Workloads

  • FINRA then applied the GPU-accelerated Spark to their production trading application workloads
  • Again, they saw around 50% performance improvements and 45% cost reductions
  • However, the initial GPU runs were not optimal, requiring collaboration with NVIDIA to identify and resolve bottlenecks

Integrating GPU Spark into the Data Pipeline

  • FINRA's data pipeline involves decompressing, type conversion, and parquet conversion of 100,000 daily CSV files
  • By transitioning this pipeline to use GPU-accelerated Spark, they achieved consistent runtime and cost savings
  • This required some code changes to leverage Spark DataFrames instead of the less GPU-friendly Dataset API

Lessons Learned and the Path Forward

  • Not every workload will see immediate benefits from GPU acceleration - identifying bottlenecks is crucial
  • FINRA has established a process to validate CPU vs GPU performance for their workloads
  • While CPU remains the default, GPU acceleration is now a strategic part of FINRA's big data technology stack for the future

Key Takeaways

  • GPU acceleration can provide significant performance improvements and cost savings for large-scale Spark workloads
  • Integrating GPU-accelerated Spark requires some upfront effort to identify and resolve bottlenecks, but can lead to transformative results
  • FINRA's experience demonstrates the real-world business impact of this technology, from regulatory compliance to fraud detection
  • As GPU hardware and software continue to evolve, GPU acceleration is becoming a strategic part of enterprise big data architectures

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.