A practitioner’s guide to data for generative AI (DAT319)

Here is a detailed summary of the video transcription in markdown format:

Key Takeaways

Work backwards from the workflow and rag techniques to determine the data sources required for your generative AI application
Leverage your existing data sources as much as possible, not just vector data
Automate as much as possible to save time on operational tasks

Preparing Data for Rag Applications

Unstructured Data

Need to chunk the data into bite-sized pieces that can be processed by the embedding model
Techniques like fixed chunking, schematic chunking, hierarchical chunking, and semantic chunking can be used
Trade-offs to consider include performance, relevancy, and cost

Structured and Semi-Structured Data

Can be queried directly using natural language queries
May need to transform into a vector representation to unlock additional semantic meaning

Index Selection

Approximate Nearest Neighbor (ANN) indexes like HNSW are popular for vector search
Need to balance factors like recall, storage size, and performance

Building the Rag Data Pipeline

Sources of Context

Situational context from operational databases
Semantic context from transformed data in vector databases
Analytics context from data lakes/warehouses

Pipeline Design Principles

Decouple data processing stages with storage
Choose technologies based on data structure and access patterns
Leverage managed/serverless services when possible
Use a log-centric design, storing raw data in S3
Optimize for cost, not just raw scale

Data Sharing and Governance

Establish a data product model to share data across teams
Implement data quality checks and lineage tracking

Tying it All Together

Focus on automating as much of the operational tasks as possible
Leverage existing data sources, not just vector data
Work backwards from the generative AI workflow and rag techniques

Your Digital Journey deserves a great story.

Build one with us.

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

A practitioner’s guide to data for generative AI (DAT319)

Key Takeaways

Preparing Data for Rag Applications

Unstructured Data

Structured and Semi-Structured Data

Index Selection

Building the Rag Data Pipeline

Sources of Context

Pipeline Design Principles

Data Sharing and Governance

Tying it All Together

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

A practitioner’s guide to data for generative AI (DAT319)

Key Takeaways

Preparing Data for Rag Applications

Unstructured Data

Structured and Semi-Structured Data

Index Selection

Building the Rag Data Pipeline

Sources of Context

Pipeline Design Principles

Data Sharing and Governance

Tying it All Together

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.