Merck advances healthcare data extraction using text-to-SQL on AWS (PRO203)
Summary
Introduction
The presentation covers how Merck is leveraging generative AI to advance healthcare data extraction using Texto SQL on AWS.
The presenters are Henry Wong (AWS Generative AI Innovation Center), Vlad Tritz (Merck), and Tess Gabmerus (AWS Generative AI Innovation Center).
Generative AI Overview
Generative AI is powered by foundational models - pre-trained on large datasets with many parameters, capable of producing original content.
Generative AI allows for faster development and quicker iteration of innovations across various industries, including healthcare and life sciences.
Healthcare Data Challenges
Exponential growth in healthcare data volume and complexity (structured, unstructured, multimodal).
Increasing data interoperability challenges due to data silos.
Traditional healthcare data analytics approaches are becoming inadequate, necessitating the use of new technologies.
Merck's Challenges in Data Analytics
Merck's real-world data ecosystem consists of many disparate data sources with unique structures and variables.
Developing SQL queries on these heterogeneous data sets is challenging.
Common data models often result in data translation losses and imperfect fidelity.
Citizen data scientists rely on cohort builders, which are expensive and require a learning curve.
Merck wants to work with pre-assembled, clinically vetted code lists for disease and treatment identification.
Texto SQL Solution
The Texto SQL solution leverages a large language model (LLM) to generate SQL queries based on user instructions and database context.
The input to the LLM includes:
Instructions: Describing the desired SQL query
Database schema: The structure of the database
Sample data: Anonymized data to understand the data format
Fishbowl examples: Reference SQL query examples
Optional: Column descriptions, table descriptions, and lookup tools
The LLM generates the SQL query and a summary of its intended functionality, allowing for user feedback and iterative refinement.
The solution also includes an execution component to run the generated SQL against the database and return the results.
A feedback system is in place to capture user input and further fine-tune the LLM's performance.
Performance and Future Directions
The solution was tested on a set of 40 questions of varying difficulty levels, achieving over 97% accuracy with a single reprompt.
Future plans include:
Expanding the solution to Merck's full real-world data ecosystem
Deploying the solution to all of Merck's real-world data users
Optimizing performance through model and prompt experimentation
Developing a real-world dataset of user questions to further evaluate the solution
Exploring more automated accuracy evaluation methods
Your Digital Journey deserves a great story.
Build one with us.
This website stores cookies on your computer.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.