A leading pharma company cuts data processing time by 80% with Databricks

About Push Down Paradigm and Spark-pandas API in Databricks

A leading Therapeutics company is committed to developing novel therapies with the potential to transform the lives of people with debilitating disorders of the brain. They are pursuing new pathways to improve brain health and run depression, neurology, and neuropsychiatry franchise programs that aim to change how brain disorders are perceived and treated. Their mission is to make medicines that matter so people can get better, sooner. They aim to transform the practice of neuroscience research and rethink how central nervous system (CNS) disorders are understood and treated. Their mission is to pioneer solutions to deliver life-changing brain health medicines, so every person can thrive.

The Challenge

The client had a significantly large volume of data already available in the Snowflake data warehouse. The expectations were to collaborate with analysts and build a data pipeline to create analytics-ready tables in the Snowflake. Considering the volume of the data, the client wanted to reduce network overhead and operational cost while performing data transformations. Also, optimise the SQL code provided by the analysts as much as possible to perform the ETL.

Our Goals

With AntStack they wanted to achieve the following objectives:

Collaborate with analysts in the timely delivery of analytics-ready data
Optimise the SQL code provided by analysts as much as possible
Reduce network and operational overhead of the data pipeline
Optimise the development process and running cost of the data pipeline

Pulling the data into Databricks to do the necessary transformations and then storing the results in Snowflake would increase the process’s latency. So, rather than retrieving the data into Databricks, we pushed the query down to Snowflake to do the necessary modifications. Snowflake provides a library that makes pushing down simple and effective. The client’s analysts transformed sophisticated and critical business logic into a Snowflake stored procedure for an analytics-ready table. Using a large Snowflake warehouse, the stored procedure took over 4 hours to implement the business logic on massive amounts of data. This execution time impedes the timely distribution of processed data. The AntStack team investigated further and discovered that the non-distributed nature of the execution of the stored process led the cause. So pushing the query down to Snowflake for that specific table did not work. Instead we loaded the data into Databricks, implemented the business rules as a Python function, and used Pandas APIs for Spark to apply the logic as an aggregate function. This method dramatically lowered the computation time from four hours to fifty minutes.

In this data pipeline, we just needed to push SQL queries from Databricks to the Snowflake warehouse. We discovered similar trends in the notebooks we had. This resulted in establishing a template notebook into which SQL queries were loaded based on the target table and context. This technique cut development time and simplified the process of adding more analytics-ready tables to the pipeline.

This image demonstrates the ETL architecture the AntStack team employed for the data pipeline.

Our Impact

50% reduction time by using templatized DBC notebook

Templatized Databricks notebooks reduced development time by 50% and made incremental changes easier.

80% reduction in execution time by using Spark-pandas API

Pandas API for Spark helped to reduce the execution time by 80% for a particular analytics-ready table.

Results That Speak Volume

The client was able to create a reliable and effective data pipeline that provided analytics-ready data on schedule and decreased operational overhead and costs thanks to the commitment and experience of the AntStack team.

50%

Reduction in Development Time

80%

Reduction in Execution Time

Pushing SQL queries down to Snowflake instead of bringing data into Databricks minimized network overhead
The data pipeline’s execution time was reduced, allowing a smaller Snowflake warehouse to serve the ETL, thereby reducing operational costs

A leading pharma company cuts data processing time by 80% with Databricks

About Push Down Paradigm and Spark-pandas API in Databricks

The Challenge

Our Goals

Our Impact

50% reduction time by using templatized DBC notebook

80% reduction in execution time by using Spark-pandas API

Headquarters

Delivery Centre

A leading pharma company cuts data processing time by 80% with Databricks

About Push Down Paradigm and Spark-pandas API in Databricks

The Challenge

Our Goals

Our Impact

50% reduction time by using templatized DBC notebook

80% reduction in execution time by using Spark-pandas API

This website stores cookies on your computer.