TalksAWS re:Invent 2025 - Turn unstructured data in Amazon S3 into AI-ready assets with SageMaker Catalog

AWS re:Invent 2025 - Turn unstructured data in Amazon S3 into AI-ready assets with SageMaker Catalog

Turning Unstructured Data into AI-Ready Assets with SageMaker Catalog

Data Readiness for AI

  • Importance of building a strong data foundation to support AI and generative AI applications
  • Shift from traditional data management to a "data mesh" approach, distributing data assets across the organization
  • Parallel shift in generative AI, moving towards multi-agent collaboration and orchestration

Data Modalities and Challenges

  • Life sciences example highlighting structured (EHR, OMICS) and unstructured (clinical notes, images, PDFs) data
  • Challenges in processing unstructured data:
    • Governance and access control
    • Selecting optimal solutions and use cases
    • Manual processing and parameter tuning
    • Orchestrating multiple models

Building a Unified Data Platform

  • AWS services for a unified data platform:
    • SageMaker Lakehouse for structured and unstructured data storage
    • SageMaker Catalog for data governance and metadata management
    • SageMaker Unified Studio for building applications and experiences

SageMaker Catalog for Unstructured Data

  • Cataloging unstructured data assets from S3 with business context and metadata
  • Associating glossary terms and providing data quality metrics
  • Enabling secure, auditable access control and permissions

Building Generative AI Applications

  • Leveraging SageMaker Unified Studio to:
    • Create knowledge bases from cataloged data
    • Apply guardrails and governance controls
    • Deploy conversational AI applications

Real-World Example: Bayer's Data Modernization Journey

  • Challenges with data silos, lack of trust, and inability to scale
  • Adopting a data mesh architecture with SageMaker Unified Studio as the central governance control plane
  • Automating biomarker data ETL using a Bedrock-powered agent
  • Benefits:
    • Accelerated time to harmonize data for clinical trials
    • Improved efficiency of R&D decision-making
    • Laying a foundation for precision medicine

Key Takeaways

  • Importance of unifying governance and access control for structured and unstructured data
  • Leveraging SageMaker services to build a scalable, governed data platform for AI
  • Automating data processing and model deployment with generative AI agents
  • Driving real-world business impact by modernizing data infrastructure and unlocking the value of unstructured data

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.