A leading therapeutics company revolutionising data engineering and transformation with Databricks

About Data Engineering On Databricks

A leading Therapeutics company is committed to developing novel therapies with the potential to transform the lives of people with debilitating disorders of the brain. They are pursuing new pathways to improve brain health and run depression, neurology, and neuropsychiatry franchise programs that aim to change how brain disorders are perceived and treated.

Their mission is to make medicines that matter so people can get better, sooner. They aim to transform the practice of neuroscience research and rethink how central nervous system (CNS) disorders are understood and treated. Their mission is to pioneer solutions to deliver life-changing brain health medicines, so every person can thrive.

The Challenge

A pioneering therapeutics company is focused on delivering life-changing brain health medicines and therapies. They focus on drug and compound research and development. They use translational data to drive efficiency in drug development, explore the impact of their proprietary compounds and understand their potential in the treatment of disorders of the brain. They have designed a portal to offer accurate, balanced, and current scientiﬁc information to support medical professionals with AntStack.

Our Goals

Their initial expectations involved building pipelines to load data from various sources, performing required transformations on said data, and making them available to business users and analysts, all using the Databricks platform. The data to be loaded ranged from research and development data regarding tests, drugs, and compounds to commercial and customer data collected both internally and from external vendors. While the data sources varied from SFTP servers and external RDBMS databases to text and CSV files made available via AWS S3. The requirement also involved the eventual development of a framework and process that could be adapted for any use case and to handle any type and scale of data.

Technology Advancement with New Serverless Platform

The therapeutics company was facing challenges with the existing system and was delighted with the following outcomes:

Speed and Reliability Goals

While they use the tool Healthchecks.io for selected use cases and follow a general practice of maintaining checklists for quality and sanity checks, they now wanted to speed up the primary metric for speed and reliability - the ability to apply the aforementioned process/ framework in non-generic use cases and ad hoc requirements. AntStack was able to provide resilient solutions to hurdles in data loading and transformation within a relatively short period while maintaining the data quality.

Simple and Effective Cron Job Monitoring

They were looking for a notification system for the nightly backups, weekly reports, cron jobs, and scheduled tasks. Most of these jobs were not running on time. AntStack solved their issues with a process flow, wherein a user generates a unique ping URL for their background job. Then update the job to send an HTTP request to the ping URL every time the job runs. When the job does not ping Healthchecks.io on time, the tool alerts the user. This simple yet effective solution helped them deliver on time.

Seamless Integration with External Storage Services

The Therapeutics company did not want to go serverless in its implementation to manage clusters. Instead, they wanted Databricks to take care of spinning up, managing, and orchestrating the compute clusters used for the ETL process as well as SQL endpoints for querying and analytics. AntStack utilised Databricks and helped them seamlessly integrate with external storage services, job orchestration, and workflow capabilities, along with GIT integration for source control and preconfigured spark environment. The program featured rich notebooks with support for multiple languages, including SQL, Python, Shell, etc., making the trade-off of managed clusters over serverless computing worth it.

Technological Loading and Transformation

They were loading data and applying the required transformations to the data, which could vary from adding new columns, doing various aggregations, and joining to combining data from various bronze tables to single or multiple target tables across the refined (silver) and trusted (gold) layers. The implementation involved using the Databricks platform to load data from various sources using spark methods available through PySpark and spark SQL. The source data is loaded from various sources using different methods of reading ‘said data’ supported by the spark to a raw (bronze) layer.

The cleaned and transformed data is then made available to business users and analysts via SQL warehouses (endpoints) provided by Databricks with granular permissions. Databricks notebooks and workflows are employed to achieve the bulk of the loading and transformation, while the AWS Glue data catalogue acts as a hive metastore alongside ample use of other AWS services like SES for reporting.

Our Impact

The Therapeutics company lacked a fast and reliable framework to load and transform huge chunks of data spread across multiple sources, systems, and teams. AntStack offered a solution involving setting up processes and templates to handle various generic data loading scenarios to improve and streamline the time taken between collecting the data and being able to explore it. This approach helped reduce the time involved and helped identify and understand the pain points that could be focused on, specifically the generic cases that have been handled faster.

A leading therapeutics company revolutionising data engineering and transformation with Databricks

About Data Engineering On Databricks

The Challenge

Our Goals

Speed and Reliability Goals

Simple and Effective Cron Job Monitoring

Seamless Integration with External Storage Services

Technological Loading and Transformation

Our Impact

Headquarters

Delivery Centre

A leading therapeutics company revolutionising data engineering and transformation with Databricks

About Data Engineering On Databricks

The Challenge

Our Goals

Speed and Reliability Goals

Simple and Effective Cron Job Monitoring

Seamless Integration with External Storage Services

Technological Loading and Transformation

Our Impact

This website stores cookies on your computer.