Here is a detailed summary of the video transcription in markdown format, broken down into sections for better readability:
Legacy System Migration Challenges
- Traditional data migration from legacy systems is a risky, expensive, and time-consuming operation.
- Legacy systems often use old programming languages, and there is a lack of people who understand the processes and data.
- Mapping old data to a new system can be challenging if the new system is not well understood.
Leveraging Contract Data to Streamline Migration
- The project aims to migrate data from a legacy system to a new system, where the data is contained in contract PDF documents.
- The traditional manual migration of 20,000 contracts would not have met the timeline, so a new approach was needed.
- The key idea is to use the contracts as the ground truth, instead of relying on the data in the legacy system.
Automated Data Extraction and Human Validation
- The project aims to extract up to 150 attributes from the contract documents.
- The data quality can vary, as some contracts are from the company's own source, while others are from third-party sources.
- The team follows a "human-in-the-loop" principle, where the business users can accept or decline the extracted data attributes to ensure high data quality.
Serverless Pipeline Architecture
- Documents are uploaded to an S3 bucket.
- A Step Functions workflow processes the batch of documents.
- A distributed map function is used to process each document in parallel, performing tasks like OCR, data cleaning, and data attribute extraction using AWS Comprehend.
- The extracted data attributes are aggregated, prioritizing more recent data.
- The results are written back to S3, and the document state is updated in an Aurora database.
Front-end Demonstration
- The front-end displays the extracted data attributes, along with the source (AI or legacy system), the extracted value, the page where the information was found, and the reasoning behind the extraction.
- The business user can accept or decline the data attributes, allowing for human validation.
- The front-end also displays alternative values from the legacy system, enabling the user to choose the most up-to-date and accurate data.
Key Takeaways
- Involve business users actively in the migration process, instead of relying on external experts.
- Combine automated data extraction with human validation to achieve high data quality.
- Establish an evaluation framework to quickly test and improve data extraction models.
- Leverage the expertise of your team to understand the old and new data structures and formats.
- Address data quality challenges, such as document classification, by incorporating business user insights.