Talks A practitioner’s guide to data for generative AI (DAT319) VIDEO
A practitioner’s guide to data for generative AI (DAT319) Here is a detailed summary of the video transcription in markdown format:
Key Takeaways
Work backwards from the workflow and rag techniques to determine the data sources required for your generative AI application
Leverage your existing data sources as much as possible, not just vector data
Automate as much as possible to save time on operational tasks
Preparing Data for Rag Applications
Unstructured Data
Need to chunk the data into bite-sized pieces that can be processed by the embedding model
Techniques like fixed chunking, schematic chunking, hierarchical chunking, and semantic chunking can be used
Trade-offs to consider include performance, relevancy, and cost
Structured and Semi-Structured Data
Can be queried directly using natural language queries
May need to transform into a vector representation to unlock additional semantic meaning
Index Selection
Approximate Nearest Neighbor (ANN) indexes like HNSW are popular for vector search
Need to balance factors like recall, storage size, and performance
Building the Rag Data Pipeline
Sources of Context
Situational context from operational databases
Semantic context from transformed data in vector databases
Analytics context from data lakes/warehouses
Pipeline Design Principles
Decouple data processing stages with storage
Choose technologies based on data structure and access patterns
Leverage managed/serverless services when possible
Use a log-centric design, storing raw data in S3
Optimize for cost, not just raw scale
Data Sharing and Governance
Establish a data product model to share data across teams
Implement data quality checks and lineage tracking
Tying it All Together
Focus on automating as much of the operational tasks as possible
Leverage existing data sources, not just vector data
Work backwards from the generative AI workflow and rag techniques
Your Digital Journey deserves a great story. Build one with us.