Data foundation in the age of generative AI (ANT302)

Data Foundation in the Age of Generative AI

Key Takeaways:

The world of data has gone through many evolutions over the past three decades, marked by key defining moments like data warehousing, Big Data, NoSQL, and machine learning.

Data has been the driving force behind these technologies, and now generative AI (Gen) is the latest development impacting data engineering.

AWS is scaling and evolving its data foundation capabilities to meet the demands of building Gen applications.

What is a Data Foundation?

A data foundation is a behind-the-scenes organizational strategy that centers around the ingestion, integration, processing, transformation, and governance of an organization's data.

It is intended to serve the needs of employees, partners, and customers who work with the organization's data.

The key goals of a data foundation are to enable data-driven decision-making and provide a rich customer experience.

The benefits of a data foundation include improved data quality, trust, and monetization, as well as better interoperability, reusability, and data governance.

How Data Foundations Change in the Age of Gen

Gen introduces the need for additional data sources, primarily in the form of unstructured data, which requires metadata discovery and management.

Data processing phases are influenced by the Gen application building approach, such as feature engineering, inference, and vector data management.

Vector data management involves tokenizing domain data, generating numerical vectors, and storing them in a vector database for fast semantic search and retrieval.

User personalization and context are important for Gen applications, requiring access to customer 360 data and real-time user information.

Comprehensive data governance becomes crucial for Gen applications, including data sharing, privacy, quality, and cataloging.

Real-World Example: Amazon Finance

Amazon Finance Operations is responsible for vendor payments, customer payments, and financial transactions at a massive scale.

To address data silos and enable a single source of truth, Amazon Finance implemented a data mesh strategy on AWS.

The data mesh approach decentralizes data management, with data producers responsible for data quality and data consumers able to easily access and use the data.

Amazon Finance leveraged AWS data integration capabilities like Redshift Data Share and AWS Lake Formation to enable secure data sharing without data duplication.

With a strong data foundation in place, Amazon Finance was able to quickly enhance their data mesh with generative AI features, such as:

Using vector embeddings and large language models to understand business context from policy documents.
Combining the business context with financial data to provide analysts with targeted problem-solving recommendations.
Deploying a Gen chatbot to improve the productivity of analysts by over 80% in responding to customer queries.

The Future of AWS Data Foundations

AWS is evolving its data foundation capabilities to provide a more unified experience, including:

Sagemaker Unified Studio: A single data and AI development environment for building applications, including Gen.
Sagemaker Data and AI Governance: Capabilities for managing data assets, models, and Gen applications with fine-grained access controls.
Sagemaker Lakehouse: A unified data management layer that brings together the strengths of data warehouses and data lakes, accessible through open APIs.

These new capabilities aim to help customers collaborate and build faster, with a comprehensive data and AI development platform on AWS.

Data foundation in the age of generative AI (ANT302)