Merck advances healthcare data extraction using text-to-SQL on AWS (PRO203)

Summary

Introduction

  • The presentation covers how Merck is leveraging generative AI to advance healthcare data extraction using Texto SQL on AWS.
  • The presenters are Henry Wong (AWS Generative AI Innovation Center), Vlad Tritz (Merck), and Tess Gabmerus (AWS Generative AI Innovation Center).

Generative AI Overview

  • Generative AI is powered by foundational models - pre-trained on large datasets with many parameters, capable of producing original content.
  • Generative AI allows for faster development and quicker iteration of innovations across various industries, including healthcare and life sciences.

Healthcare Data Challenges

  • Exponential growth in healthcare data volume and complexity (structured, unstructured, multimodal).
  • Increasing data interoperability challenges due to data silos.
  • Traditional healthcare data analytics approaches are becoming inadequate, necessitating the use of new technologies.

Merck's Challenges in Data Analytics

  • Merck's real-world data ecosystem consists of many disparate data sources with unique structures and variables.
  • Developing SQL queries on these heterogeneous data sets is challenging.
  • Common data models often result in data translation losses and imperfect fidelity.
  • Citizen data scientists rely on cohort builders, which are expensive and require a learning curve.
  • Merck wants to work with pre-assembled, clinically vetted code lists for disease and treatment identification.

Texto SQL Solution

  • The Texto SQL solution leverages a large language model (LLM) to generate SQL queries based on user instructions and database context.
  • The input to the LLM includes:
    • Instructions: Describing the desired SQL query
    • Database schema: The structure of the database
    • Sample data: Anonymized data to understand the data format
    • Fishbowl examples: Reference SQL query examples
    • Optional: Column descriptions, table descriptions, and lookup tools
  • The LLM generates the SQL query and a summary of its intended functionality, allowing for user feedback and iterative refinement.
  • The solution also includes an execution component to run the generated SQL against the database and return the results.
  • A feedback system is in place to capture user input and further fine-tune the LLM's performance.

Performance and Future Directions

  • The solution was tested on a set of 40 questions of varying difficulty levels, achieving over 97% accuracy with a single reprompt.
  • Future plans include:
    • Expanding the solution to Merck's full real-world data ecosystem
    • Deploying the solution to all of Merck's real-world data users
    • Optimizing performance through model and prompt experimentation
    • Developing a real-world dataset of user questions to further evaluate the solution
    • Exploring more automated accuracy evaluation methods

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us