TalksAWS re:Invent 2025 - Data Processing architectures for building AI solutions (ANT328)

AWS re:Invent 2025 - Data Processing architectures for building AI solutions (ANT328)

Data Processing Architectures for Building AI Solutions

Role of Data in AI

89% of companies are prioritizing generative AI initiatives, but over 50% feel their data foundation is not ready for AI
AI systems like an HR onboarding agent require access to structured, unstructured, and real-time data with proper security and governance

Challenges for AI-Ready Data Foundations

Blurring lines between data engineering, ML engineering, and AI engineering roles
Agility to make existing data platforms "AI-ready" without completely replacing them
Improving efficiency and productivity of the data processing layer

Foundational Pillars of a Modern Data Foundation

Onboarding structured, semi-structured, and unstructured data
Data cleansing, enrichment, and transformation with AI-driven acceleration
Unified metadata layer for technical and business context
Enabling analytics and AI use cases on the data platform

Reference Architecture for AI-Ready Data Foundation

Lakehouse storage on Amazon S3, S3 Tables, or Amazon Redshift
Batch, streaming, and zero-copy data ingestion mechanisms
Unified data catalog with Amazon SageMaker Catalog
Governance capabilities like data quality, lineage, and sharing
Analytics and AI use cases leveraging services like Amazon Quicksight, AWS Glue, Amazon EMR, and Amazon Bedrock

Unlocking Enterprise Data for AI Agents

Exposing data subsets via APIs, vector stores, or data warehouses
Retrieval Augmented Generation (RAG) to augment AI models with real-time data
- Improves accuracy, reduces hallucination, enables domain adaptation
- Architecture uses Amazon Bedrock, Amazon OpenSearch/S3 Vectors

Model Context Protocol (MCP) Servers

Open standard to allow AI assistants to access real-time data
Integrated MCP servers for AWS Glue, Amazon EMR, and Amazon Athena
Reduces integration complexity, provides AI-driven insights, and simplifies observability

Improving Data Pipeline Productivity

Auto-generating code and SQL statements using Amazon Q Developer
Building visual ETL pipelines with AI-driven prompt-based automation

Demo 1: Data Processing MCP Server

Onboarded diabetes dataset into AWS Glue Data Catalog
Configured and activated Data Processing MCP server in SageMaker Studio
Used Amazon Q Developer to generate a Jupyter notebook accessing the diabetes data via MCP server

Demo 2: Auto-Generating Visual ETL Pipelines

Used prompts to auto-generate a visual ETL pipeline to join customer behavior and customer dimension data
Performed transformations like type casting and column renaming
Aggregated data by state to calculate total page views and purchase amount

Enhancing AWS Data Processing Engines for AI Readiness

Challenges: High-volume data ingestion, data quality, identity/access control, integrated AI/ML environment, model training and inference
Identity and Access Control:
- Trusted Identity Propagation for single sign-on and fine-grained permissions
- S3 Access Grants and Lake Formation Full Table Access for data access control
- Lake Formation Fine-Grained Access Control for column/row/cell-level security
Integrated AI/ML Experience:
- SageMaker Notebooks for quick start with Spark-powered notebooks and AI-driven code assistance
- Spark Upgrade Agent to automate upgrading Spark applications with error handling and data quality checks
Performance Enhancements:
- 4.4-4.5x faster Spark performance compared to open-source
- 2x better write performance with Iceberg
- EMR Serverless Storage Provisioning for remote shuffle storage

Key Takeaways

Enterprises need to revisit their data foundation to make it AI-ready, addressing people, process, and technology challenges
Unlocking enterprise data for AI agents through RAG and MCP servers can improve AI model accuracy and flexibility
Automating data pipeline development with AI-driven code generation and visual ETL can boost productivity
AWS is enhancing its data processing services like EMR, Glue, and Athena to address identity, access, integration, and performance needs for AI workloads

Your Digital Journey deserves a great story.

Build one with us.

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

AWS re:Invent 2025 - Data Processing architectures for building AI solutions (ANT328)

Data Processing Architectures for Building AI Solutions

Role of Data in AI

Challenges for AI-Ready Data Foundations

Foundational Pillars of a Modern Data Foundation

Reference Architecture for AI-Ready Data Foundation

Unlocking Enterprise Data for AI Agents

Model Context Protocol (MCP) Servers

Improving Data Pipeline Productivity

Demo 1: Data Processing MCP Server

Demo 2: Auto-Generating Visual ETL Pipelines

Enhancing AWS Data Processing Engines for AI Readiness

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Data Processing architectures for building AI solutions (ANT328)

Data Processing Architectures for Building AI Solutions

Role of Data in AI

Challenges for AI-Ready Data Foundations

Foundational Pillars of a Modern Data Foundation

Reference Architecture for AI-Ready Data Foundation

Unlocking Enterprise Data for AI Agents

Model Context Protocol (MCP) Servers

Improving Data Pipeline Productivity

Demo 1: Data Processing MCP Server

Demo 2: Auto-Generating Visual ETL Pipelines

Enhancing AWS Data Processing Engines for AI Readiness

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.