Push Down Paradigm and Spark-pandas API in Databricks

Healthcare

Therapeutics company reduced data processing time and operational cost by 80% with AntStack

Push Down Paradigm and Spark-pandas API in Databricks Artwork

About Push Down Paradigm and Spark-pandas API in Databricks

A leading Therapeutics company is committed to developing novel therapies with the potential to transform the lives of people with debilitating disorders of the brain. They are pursuing new pathways to improve brain health and run depression, neurology, and neuropsychiatry franchise programs that aim to change how brain disorders are perceived and treated. Their mission is to make medicines that matter so people can get better, sooner. They aim to transform the practice of neuroscience research and rethink how central nervous system (CNS) disorders are understood and treated. Their mission is to pioneer solutions to deliver life-changing brain health medicines, so every person can thrive.

The Challenge

The client had a significantly large volume of data already available in the Snowflake data warehouse. The expectations were to collaborate with analysts and build a data pipeline to create analytics-ready tables in the snowflake. Considering the volume of the data, the client wanted to reduce network overhead and operational cost while performing data transformations. Also, optimise the SQL code provided by the analysts as much as possible to perform the ETL.

Our Goals

With AntStack they wanted to achieve the following objectives:

  • Collaborate with analysts in timely delivery of analytics ready data.
  • Optimise the SQL code provided by analysts as much as possible.
  • Reduce network and operational overhead of the data pipeline.
  • Optimise development process and running cost of the data pipeline.

Pulling the data into Databricks to do the necessary transformations and then storing the results in Snowflake would increase the process’s latency. Rather than retrieving the data into Databricks, we pushed the query down to Snowflake to do the necessary modifications. Snowflake provides a library that makes pushing down simple and effective. The client’s analysts transformed sophisticated and critical business logic into a Snowflake stored procedure for an analytics-ready table. Using a large Snowflake warehouse, the stored procedure took over 4 hours to implement the business logic on massive amounts of data. This execution time impedes the timely distribution of processed data. The AntStack team investigated further and discovered that the non-distributed nature of the execution of the stored process led the cause. So pushing the query down to Snowflake for that specific table did not work. Instead of pushing the query down to Snowflake, we loaded the data into Databricks, implemented the business rules as a Python function, and used Pandas APIs for Spark to apply the logic as an aggregate function. This method dramatically lowered the computation time from four hours to fifty minutes.

In this data pipeline, we just needed to push SQL queries from Databricks to the Snowflake warehouse. We discovered similar trends in the notebooks we had. This resulted in the establishment of a template notebook into which SQL queries were loaded based on the target table and context. This technique cut development time and simplified the process of adding more analytics-ready tables to the pipeline. blog-Page-transparent.png

The above image demonstrates the ETL architecture the AntStack team employed for the data pipeline.

Impact

50% reduction time by using templatized dbc notebook

Templatized Databricks notebooks reduced development time by 50% and made incremental changes easier.

50% reduction time by using templatized dbc notebook

80% reduction in execution time by using spark-pandas API

Pandas API for Spark helped to reduce the execution time by 80% for a particular analytics-ready table.

80% reduction in execution time by using spark-pandas API

Results that make a statement

The client was able to create a reliable and effective data pipeline that provided analytics-ready data on schedule and decreased operational overhead and costs thanks to the commitment and experience of the AntStack team.

50%
50%
Reduction in Development Time
80%
80%
Reduction in Execution Time
  • Pushing SQL queries down to Snowflake instead of bringing data into Databricks minimized network overhead.
  • Pandas API for Spark helped to reduce the execution time by 80% for a particular analytics-ready table.
  • Templatized Databricks notebooks reduced development time by 50% and made incremental changes easier.
  • The data pipeline’s execution time was reduced, allowing a smaller Snowflake warehouse to serve the ETL, thereby reducing operational costs.
Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information in order to improve and customize your browsing experience and for analytics and metrics about our visitors on this website.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference not to be tracked.

Build With Us