Accelerate Apache Spark up to 5 times on AWS with RAPIDS (ANT208)
Accelerating Batch Data Processing on GPUs with Rapids Accelerator for Apache Spark
Overview
Challenges of big data processing:
Exponential growth of data from various sources (internet, consumer devices, IoT)
Need for high-quality data at low latency
Approaches to handle data growth:
Horizontal scaling of compute resources
Downsampling data
Longer processing times
Proposed solution: Accelerate batch data processing using Nvidia GPUs and the Rapids Accelerator for Apache Spark
Apache Spark 3 and Rapids Accelerator
Apache Spark 3 features that enable GPU processing:
Resource-aware scheduling
Columnar data processing
Plugin architecture
Rapids Accelerator for Apache Spark:
Plug-in that works with different Apache Spark distributions (EMR, Databricks, etc.)
Seamless integration with existing user workflows (SQL, DataFrames)
Transparent GPU acceleration, with fallback to CPU for unsupported operations
Performance and Cost Savings
Benchmarks:
NDS (Nvidia Decision Support Benchmark) on AWS EC2 instances:
Up to 9x faster performance on GPU, with 4.8x average speedup and 72% cost savings
NDS on AWS EMR:
Up to 3.5x faster performance on GPU, with 1.3x average speedup and 14% cost savings
Transaction Fraud Use Case:
14x speedup and 87% cost savings when running on GPU cluster
Identifying Workloads for GPU Acceleration
Spark Rapids User Tools:
Open-source tool to analyze Spark event logs and recommend workloads for GPU acceleration
Provides recommendations on cluster configuration and job tuning
Getting Started
Documentation:
Nvidia Rapids Accelerator for Apache Spark documentation
AWS EMR documentation on using the Rapids Accelerator
Spark Rapids User Tools:
Available on Python Package Installation site
Notebook for running the tool on EMR
Conclusion
Exponential data growth can be addressed by accelerating batch data processing on GPUs using the Rapids Accelerator for Apache Spark
Significant performance improvements and cost savings can be achieved by identifying the right workloads for GPU acceleration
The open-source Spark Rapids User Tools can help in this process by analyzing Spark event logs and providing recommendations
Your Digital Journey deserves a great story.
Build one with us.
This website stores cookies on your computer.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.