TalksAWS re:Invent 2025 - Data Processing architectures for building AI solutions (ANT328)

AWS re:Invent 2025 - Data Processing architectures for building AI solutions (ANT328)

Data Processing Architectures for Building AI Solutions

Role of Data in AI

  • 89% of companies are prioritizing generative AI initiatives, but over 50% feel their data foundation is not ready for AI
  • AI systems like an HR onboarding agent require access to structured, unstructured, and real-time data with proper security and governance

Challenges for AI-Ready Data Foundations

  1. Blurring lines between data engineering, ML engineering, and AI engineering roles
  2. Agility to make existing data platforms "AI-ready" without completely replacing them
  3. Improving efficiency and productivity of the data processing layer

Foundational Pillars of a Modern Data Foundation

  1. Onboarding structured, semi-structured, and unstructured data
  2. Data cleansing, enrichment, and transformation with AI-driven acceleration
  3. Unified metadata layer for technical and business context
  4. Enabling analytics and AI use cases on the data platform

Reference Architecture for AI-Ready Data Foundation

  • Lakehouse storage on Amazon S3, S3 Tables, or Amazon Redshift
  • Batch, streaming, and zero-copy data ingestion mechanisms
  • Unified data catalog with Amazon SageMaker Catalog
  • Governance capabilities like data quality, lineage, and sharing
  • Analytics and AI use cases leveraging services like Amazon Quicksight, AWS Glue, Amazon EMR, and Amazon Bedrock

Unlocking Enterprise Data for AI Agents

  1. Exposing data subsets via APIs, vector stores, or data warehouses
  2. Retrieval Augmented Generation (RAG) to augment AI models with real-time data
    • Improves accuracy, reduces hallucination, enables domain adaptation
    • Architecture uses Amazon Bedrock, Amazon OpenSearch/S3 Vectors

Model Context Protocol (MCP) Servers

  • Open standard to allow AI assistants to access real-time data
  • Integrated MCP servers for AWS Glue, Amazon EMR, and Amazon Athena
  • Reduces integration complexity, provides AI-driven insights, and simplifies observability

Improving Data Pipeline Productivity

  1. Auto-generating code and SQL statements using Amazon Q Developer
  2. Building visual ETL pipelines with AI-driven prompt-based automation

Demo 1: Data Processing MCP Server

  • Onboarded diabetes dataset into AWS Glue Data Catalog
  • Configured and activated Data Processing MCP server in SageMaker Studio
  • Used Amazon Q Developer to generate a Jupyter notebook accessing the diabetes data via MCP server

Demo 2: Auto-Generating Visual ETL Pipelines

  • Used prompts to auto-generate a visual ETL pipeline to join customer behavior and customer dimension data
  • Performed transformations like type casting and column renaming
  • Aggregated data by state to calculate total page views and purchase amount

Enhancing AWS Data Processing Engines for AI Readiness

  • Challenges: High-volume data ingestion, data quality, identity/access control, integrated AI/ML environment, model training and inference
  • Identity and Access Control:
    • Trusted Identity Propagation for single sign-on and fine-grained permissions
    • S3 Access Grants and Lake Formation Full Table Access for data access control
    • Lake Formation Fine-Grained Access Control for column/row/cell-level security
  • Integrated AI/ML Experience:
    • SageMaker Notebooks for quick start with Spark-powered notebooks and AI-driven code assistance
    • Spark Upgrade Agent to automate upgrading Spark applications with error handling and data quality checks
  • Performance Enhancements:
    • 4.4-4.5x faster Spark performance compared to open-source
    • 2x better write performance with Iceberg
    • EMR Serverless Storage Provisioning for remote shuffle storage

Key Takeaways

  • Enterprises need to revisit their data foundation to make it AI-ready, addressing people, process, and technology challenges
  • Unlocking enterprise data for AI agents through RAG and MCP servers can improve AI model accuracy and flexibility
  • Automating data pipeline development with AI-driven code generation and visual ETL can boost productivity
  • AWS is enhancing its data processing services like EMR, Glue, and Athena to address identity, access, integration, and performance needs for AI workloads

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.