Accelerate Apache Spark up to 5 times on AWS with RAPIDS (ANT208)

Accelerating Batch Data Processing on GPUs with Rapids Accelerator for Apache Spark

Overview

  • Challenges of big data processing:
    • Exponential growth of data from various sources (internet, consumer devices, IoT)
    • Need for high-quality data at low latency
  • Approaches to handle data growth:
    • Horizontal scaling of compute resources
    • Downsampling data
    • Longer processing times
  • Proposed solution: Accelerate batch data processing using Nvidia GPUs and the Rapids Accelerator for Apache Spark

Apache Spark 3 and Rapids Accelerator

  • Apache Spark 3 features that enable GPU processing:
    • Resource-aware scheduling
    • Columnar data processing
    • Plugin architecture
  • Rapids Accelerator for Apache Spark:
    • Plug-in that works with different Apache Spark distributions (EMR, Databricks, etc.)
    • Seamless integration with existing user workflows (SQL, DataFrames)
    • Transparent GPU acceleration, with fallback to CPU for unsupported operations

Performance and Cost Savings

  • Benchmarks:
    • NDS (Nvidia Decision Support Benchmark) on AWS EC2 instances:
      • Up to 9x faster performance on GPU, with 4.8x average speedup and 72% cost savings
    • NDS on AWS EMR:
      • Up to 3.5x faster performance on GPU, with 1.3x average speedup and 14% cost savings
  • Transaction Fraud Use Case:
    • 14x speedup and 87% cost savings when running on GPU cluster

Identifying Workloads for GPU Acceleration

  • Spark Rapids User Tools:
    • Open-source tool to analyze Spark event logs and recommend workloads for GPU acceleration
    • Provides recommendations on cluster configuration and job tuning

Getting Started

  • Documentation:
    • Nvidia Rapids Accelerator for Apache Spark documentation
    • AWS EMR documentation on using the Rapids Accelerator
  • Spark Rapids User Tools:
    • Available on Python Package Installation site
    • Notebook for running the tool on EMR

Conclusion

  • Exponential data growth can be addressed by accelerating batch data processing on GPUs using the Rapids Accelerator for Apache Spark
  • Significant performance improvements and cost savings can be achieved by identifying the right workloads for GPU acceleration
  • The open-source Spark Rapids User Tools can help in this process by analyzing Spark event logs and providing recommendations

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us