TalksAWS re:Invent 2025 - Accelerate data discovery with object metadata in Amazon S3 (STG357)

AWS re:Invent 2025 - Accelerate data discovery with object metadata in Amazon S3 (STG357)

Accelerating Data Discovery with Amazon S3 Metadata

Addressing Data Discovery Challenges

  • Data growth is exploding, with Amazon S3 now storing over 500 trillion objects
  • The rise of AI/ML has made unstructured data in S3 buckets valuable training data
  • However, finding and accessing the right data sets quickly is a major challenge for organizations
  • Traditional metadata solutions are complex, difficult to build and maintain, and often provide stale data

Introducing S3 Metadata

  • S3 metadata provides automatic, comprehensive, and up-to-date metadata extraction for S3 objects
  • Captures both system metadata (object size, type, encryption) and custom metadata (user-defined tags)
  • Built on Apache Iceberg format and stored in managed S3 table buckets

S3 Metadata Tables

  • Journal table: Audit log of all put, delete, and modification events
    • Refreshes within minutes, with automatic expiration of old records
  • Live inventory table: Detailed snapshot of all objects in the bucket, refreshed hourly
    • Enables fast analytics and reporting without expensive list requests

Querying S3 Metadata

  • S3 metadata tables are Apache Iceberg format, stored in managed S3 table buckets
  • Can be queried using a variety of analytics services and engines (Athena, Redshift, Spark, etc.)
  • Seamless integration with AWS Lake Formation for fine-grained access control

Leveraging Metadata for Data Discovery

Request Metadata

  • Use journal table to audit who is accessing/modifying data
  • Identify and revert unwanted deletions in versioned buckets

System Metadata

  • Quickly find unencrypted objects to update encryption policies
  • Analyze data upload patterns and storage class usage

Custom Metadata

  • Annotate AI-generated data to separate from non-AI sources
  • Add contextual metadata to sensor data, scientific data, etc.

Automating Storage Management

  • Use metadata queries to identify objects in Glacier storage for batch restoration
  • Pass query results as a manifest file to S3 Batch Operations for automated processing

Natural Language Querying with MCP for S3 Tables

  • Leverage Kendra CLI and MCP for S3 Tables to query metadata using natural language
  • Automatically generates SQL queries and provides summarized insights

Real-World Impact

  • Medical imaging company streamlined CT scan processing with S3 metadata
  • Digital content provider gained full visibility into multi-petabyte migration to S3

Key Takeaways

  1. S3 metadata provides always-current metadata for fast data discovery
  2. Metadata tables enable building smart workflows and taking action on storage insights
  3. S3 metadata lays the foundation for intelligent data lake management with AI agents

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.