Build a data foundation to fuel generative AI (STG201)

Building a Data Foundation to Fuel Generative AI

Reusing Existing Foundation Models

Retrieval-Augmented Generation (RAG)

  • RAG involves using your organizational data to customize the output of an existing foundation model without retraining it.
  • The workflow includes:
    1. Creating a library of information from your organizational data in an Amazon S3 bucket or other data sources.
    2. Sending this library to structure the output of the foundation model, a process called "prompting."
  • AWS provides tools like Amazon Bedrock to build RAG workflows.
    • Example: Pinterest used RAG to improve the productivity of their data engineers by 40% by converting text-based questions into SQL queries.

Fine-Tuning

  • Fine-tuning involves adapting an existing foundation model to perform specialized tasks not covered in the pre-training process.
    • This reduces training time, improves model accuracy, and allows for customization.
  • Customers use their domain-specific enterprise data to adapt the model for their specific needs.
    • Example: Booking.com fine-tuned the Llama 2 model to create an intent detection model with 7 billion parameters, which they used in their AI trip planner.

Continued Pre-Training and Training New Models

  • Continued pre-training involves picking up where the model provider left off and using your enterprise data to extend the specialized and generalized knowledge of the model.
  • Training a new model from scratch is required if the desired model does not yet exist.
    • These approaches are resource-intensive and require specialized skills, but can yield superior model accuracy and relevance.
  • AWS provides tools like Amazon SageMaker and GPU instances, as well as specialized storage offerings like Amazon FSx for Lustre, to support training new foundation models.
    • Example: LG AI Research used Amazon SageMaker and Amazon FSx for Lustre to create their 300 billion parameter multimodal model, EXAONE, within one year.

Data Considerations

Data Discovery

  • Identify data sources, including internal data, public datasets, and licensed content.
  • Consider various data formats, such as structured data (e.g., Parquet) and unstructured data (e.g., text, images, audio).
  • Load data into an Amazon S3 data lake in batches or real-time using tools like AWS CLI, AWS DataSync, and Amazon Kinesis.

Data Preparation

  • Enrich data with relevant metadata, such as titles, summaries, and categories for RAG workflows, or more extensive annotations like instruction-response formats for fine-tuning or pre-training.
  • Store metadata in Amazon S3 object tags, a dedicated metadata store, or a combination.
  • For textual data, perform tokenization and create embeddings, storing them in a vector database like OpenSearch for efficient similarity searches.

Data Governance

  • Establish a data perimeter to ensure trusted identities are accessing trusted resources within expected networks.
  • Audit data access and model invocation through integration with AWS CloudTrail.
  • Leverage the security and compliance capabilities of Amazon S3 for your data sources.

Storage Considerations

Reusing Existing Foundation Models

  • Amazon S3 provides cost-effective storage for hot and cold data sets, high-performance data retrieval and processing, and deep integration with AI/ML workflows.
  • S3 Intelligent-Tiering can automatically optimize costs by moving data across different storage classes.
  • Amazon S3 connectors like file mode, fast file mode, and Mountpoint can improve compatibility with ML workflows.

Training New Models

  • Amazon FSx for Lustre provides a highly scalable, low-latency file system for training new foundation models.
  • FSx for Lustre offers hundreds of GB/s of throughput, millions of IOPS, and sub-millisecond latencies, optimized for GPU-intensive workloads.
  • FSx for Lustre can be seamlessly integrated with an Amazon S3 data lake, allowing researchers to benefit from both a collaborative file system and the governance capabilities of S3.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us