A practitioner’s guide to data for generative AI (DAT319)

Here is a detailed summary of the video transcription in markdown format:

Key Takeaways

  • Work backwards from the workflow and rag techniques to determine the data sources required for your generative AI application
  • Leverage your existing data sources as much as possible, not just vector data
  • Automate as much as possible to save time on operational tasks

Preparing Data for Rag Applications

Unstructured Data

  • Need to chunk the data into bite-sized pieces that can be processed by the embedding model
  • Techniques like fixed chunking, schematic chunking, hierarchical chunking, and semantic chunking can be used
  • Trade-offs to consider include performance, relevancy, and cost

Structured and Semi-Structured Data

  • Can be queried directly using natural language queries
  • May need to transform into a vector representation to unlock additional semantic meaning

Index Selection

  • Approximate Nearest Neighbor (ANN) indexes like HNSW are popular for vector search
  • Need to balance factors like recall, storage size, and performance

Building the Rag Data Pipeline

Sources of Context

  • Situational context from operational databases
  • Semantic context from transformed data in vector databases
  • Analytics context from data lakes/warehouses

Pipeline Design Principles

  1. Decouple data processing stages with storage
  2. Choose technologies based on data structure and access patterns
  3. Leverage managed/serverless services when possible
  4. Use a log-centric design, storing raw data in S3
  5. Optimize for cost, not just raw scale

Data Sharing and Governance

  • Establish a data product model to share data across teams
  • Implement data quality checks and lineage tracking

Tying it All Together

  • Focus on automating as much of the operational tasks as possible
  • Leverage existing data sources, not just vector data
  • Work backwards from the generative AI workflow and rag techniques

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us