TalksAWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

AWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

Building Apache Iceberg-Based Lakehouse Architectures on AWS

The Data Lake Crisis and the Rise of Apache Iceberg

The Challenges of Traditional Data Lakes

  • Data lakes faced issues like data corruption, lack of point-in-time snapshots, slow queries, and difficulties with schema evolution
  • These problems led to the emergence of the Apache Iceberg table format as a solution

Key Benefits of Apache Iceberg

  1. Asset Guarantees: Iceberg provides optimistic concurrency control, allowing multiple writers to update the same table without data corruption
  2. Time Travel and Snapshots: Iceberg's table-level snapshots enable point-in-time rollbacks and time travel queries for debugging and checkpointing
  3. Elegant Schema Evolution: Iceberg tracks columns by ID, allowing easy addition, update, or renaming of columns without rewriting data
  4. Improved Query Performance: Iceberg's optimized metadata layout enables efficient queries without expensive full-table scans
  5. Efficient Row-Level Updates: Iceberg supports equality and positional deletes, as well as deletion vectors for improved performance on CDC-heavy workloads
  6. Variant Type Support and Row Lineage: Iceberg v3 adds native support for semi-structured data and row-level lineage tracking for improved auditability

The AWS Stack for Iceberg-Powered Lakehouses

Integrated Ecosystem for Iceberg

  • Data sources are ingested into Iceberg tables using AWS services like Glue, EMR, Kinesis, and Firehose
  • The Glue Data Catalog acts as the technical metadata store, with Lake Formation providing enterprise-scale governance
  • Compute engines like Athena, Redshift, and EMR Spark natively support Iceberg tables for analytics and processing

Glue Data Catalog and Lake Formation

  • Glue Data Catalog is an Iceberg-compatible metadata store, supporting multi-catalog federation
  • Lake Formation provides fine-grained access control, data mesh capabilities, and credential vending for secure data access

S3 Table Buckets for Storage Optimization

  • S3 Table Buckets automatically handle Iceberg table maintenance tasks like compaction, snapshot retention, and orphan file cleanup
  • This reduces operational overhead and ensures optimal query performance and storage efficiency

Agentized Lakehouse with Glue Data Catalog

  • Glue Data Catalog exposes APIs that enable intelligent agents to discover tables, detect schema drift, and optimize pipelines

Iceberg-Based Architecture Patterns

Batch ETL

  • Iceberg provides schema evolution, partition evolution, and time travel capabilities to simplify batch ETL pipelines
  • Leveraging S3 Table Buckets for storage optimization and tuning partitioning strategies can further improve query performance

Change Data Capture (CDC)

  • Iceberg's efficient upserts and merge operations, as well as support for deletion vectors, make it well-suited for CDC workloads
  • Merge-on-read and copy-on-write strategies can be used depending on the write-heavy or read-heavy nature of the workload

High-Concurrency Streaming

  • Iceberg's multi-writer support and exactly-once semantics make it a good fit for high-throughput streaming workloads
  • S3 Table Buckets handle small file management, and namespace isolation can be used to avoid commit conflicts

Medidata's Iceberg Transformation Journey

Previous Architecture Challenges

  • Medidata's previous batch-based data integration pipelines faced issues with latency, inconsistency, and scalability as data volumes grew

Iceberg-Based Solution

  • Medidata replaced their batch pipelines with a streaming-based architecture using Iceberg, Flink, and Kafka
  • This unified their data estate, improved data availability and integrity, and simplified observability and security

Benefits Realized

  • Significant reduction in data latency and improved data consistency
  • Improved data scalability by leveraging S3 as the primary storage layer
  • Simplified observability and security by centralizing data access through the Glue Data Catalog

Optimizing Iceberg-Powered Analytics on AWS

Query Federation and Catalog Integration

  • Catalog federation enables querying Iceberg tables across multiple catalogs, providing a unified view of data
  • This improves agile analytics and cross-source reporting capabilities

Materialized Views for Performance

  • Materialized views in Glue provide pre-computed aggregates stored as Iceberg tables, boosting query performance
  • Best practices include identifying high-frequency queries, optimizing storage, and managing refresh schedules

Integrating Iceberg with AWS Compute Services

  • Athena leverages Iceberg's metadata for efficient query execution and supports table management operations
  • Redshift integrates seamlessly with Iceberg, providing enterprise-scale concurrency and performance optimizations
  • EMR Spark provides flexible, scalable, and cost-optimized Iceberg analytics through features like adaptive query execution

Key Takeaways

  1. Use S3 Table Buckets for managed Iceberg table maintenance and storage optimization
  2. Leverage materialized views in Glue to pre-compute aggregates and boost query performance
  3. Prioritize catalog management and federation to enable agile, cross-source analytics
  4. Integrate Iceberg with AWS compute services like Athena, Redshift, and EMR Spark to unlock enterprise-grade analytics
  5. Iceberg powers both batch and real-time data pipelines, enabling consistent, low-latency data access

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.