Talks AWS re:Invent 2025 - Accelerate data discovery with object metadata in Amazon S3 (STG357) VIDEO
AWS re:Invent 2025 - Accelerate data discovery with object metadata in Amazon S3 (STG357) Accelerating Data Discovery with Amazon S3 Metadata
Addressing Data Discovery Challenges
Data growth is exploding, with Amazon S3 now storing over 500 trillion objects
The rise of AI/ML has made unstructured data in S3 buckets valuable training data
However, finding and accessing the right data sets quickly is a major challenge for organizations
Traditional metadata solutions are complex, difficult to build and maintain, and often provide stale data
Introducing S3 Metadata
S3 metadata provides automatic, comprehensive, and up-to-date metadata extraction for S3 objects
Captures both system metadata (object size, type, encryption) and custom metadata (user-defined tags)
Built on Apache Iceberg format and stored in managed S3 table buckets
S3 Metadata Tables
Journal table: Audit log of all put, delete, and modification events
Refreshes within minutes, with automatic expiration of old records
Live inventory table: Detailed snapshot of all objects in the bucket, refreshed hourly
Enables fast analytics and reporting without expensive list requests
Querying S3 Metadata
S3 metadata tables are Apache Iceberg format, stored in managed S3 table buckets
Can be queried using a variety of analytics services and engines (Athena, Redshift, Spark, etc.)
Seamless integration with AWS Lake Formation for fine-grained access control
Leveraging Metadata for Data Discovery
Request Metadata
Use journal table to audit who is accessing/modifying data
Identify and revert unwanted deletions in versioned buckets
System Metadata
Quickly find unencrypted objects to update encryption policies
Analyze data upload patterns and storage class usage
Custom Metadata
Annotate AI-generated data to separate from non-AI sources
Add contextual metadata to sensor data, scientific data, etc.
Automating Storage Management
Use metadata queries to identify objects in Glacier storage for batch restoration
Pass query results as a manifest file to S3 Batch Operations for automated processing
Natural Language Querying with MCP for S3 Tables
Leverage Kendra CLI and MCP for S3 Tables to query metadata using natural language
Automatically generates SQL queries and provides summarized insights
Real-World Impact
Medical imaging company streamlined CT scan processing with S3 metadata
Digital content provider gained full visibility into multi-petabyte migration to S3
Key Takeaways
S3 metadata provides always-current metadata for fast data discovery
Metadata tables enable building smart workflows and taking action on storage insights
S3 metadata lays the foundation for intelligent data lake management with AI agents
Your Digital Journey deserves a great story. Build one with us.