TalksAWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

AWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

Building Apache Iceberg-Based Lakehouse Architectures on AWS

The Data Lake Crisis and the Rise of Apache Iceberg

The Challenges of Traditional Data Lakes

Data lakes faced issues like data corruption, lack of point-in-time snapshots, slow queries, and difficulties with schema evolution
These problems led to the emergence of the Apache Iceberg table format as a solution

Key Benefits of Apache Iceberg

Asset Guarantees: Iceberg provides optimistic concurrency control, allowing multiple writers to update the same table without data corruption
Time Travel and Snapshots: Iceberg's table-level snapshots enable point-in-time rollbacks and time travel queries for debugging and checkpointing
Elegant Schema Evolution: Iceberg tracks columns by ID, allowing easy addition, update, or renaming of columns without rewriting data
Improved Query Performance: Iceberg's optimized metadata layout enables efficient queries without expensive full-table scans
Efficient Row-Level Updates: Iceberg supports equality and positional deletes, as well as deletion vectors for improved performance on CDC-heavy workloads
Variant Type Support and Row Lineage: Iceberg v3 adds native support for semi-structured data and row-level lineage tracking for improved auditability

The AWS Stack for Iceberg-Powered Lakehouses

Integrated Ecosystem for Iceberg

Data sources are ingested into Iceberg tables using AWS services like Glue, EMR, Kinesis, and Firehose
The Glue Data Catalog acts as the technical metadata store, with Lake Formation providing enterprise-scale governance
Compute engines like Athena, Redshift, and EMR Spark natively support Iceberg tables for analytics and processing

Glue Data Catalog and Lake Formation

Glue Data Catalog is an Iceberg-compatible metadata store, supporting multi-catalog federation
Lake Formation provides fine-grained access control, data mesh capabilities, and credential vending for secure data access

S3 Table Buckets for Storage Optimization

S3 Table Buckets automatically handle Iceberg table maintenance tasks like compaction, snapshot retention, and orphan file cleanup
This reduces operational overhead and ensures optimal query performance and storage efficiency

Agentized Lakehouse with Glue Data Catalog

Glue Data Catalog exposes APIs that enable intelligent agents to discover tables, detect schema drift, and optimize pipelines

Iceberg-Based Architecture Patterns

Batch ETL

Iceberg provides schema evolution, partition evolution, and time travel capabilities to simplify batch ETL pipelines
Leveraging S3 Table Buckets for storage optimization and tuning partitioning strategies can further improve query performance

Change Data Capture (CDC)

Iceberg's efficient upserts and merge operations, as well as support for deletion vectors, make it well-suited for CDC workloads
Merge-on-read and copy-on-write strategies can be used depending on the write-heavy or read-heavy nature of the workload

High-Concurrency Streaming

Iceberg's multi-writer support and exactly-once semantics make it a good fit for high-throughput streaming workloads
S3 Table Buckets handle small file management, and namespace isolation can be used to avoid commit conflicts

Medidata's Iceberg Transformation Journey

Previous Architecture Challenges

Medidata's previous batch-based data integration pipelines faced issues with latency, inconsistency, and scalability as data volumes grew

Iceberg-Based Solution

Medidata replaced their batch pipelines with a streaming-based architecture using Iceberg, Flink, and Kafka
This unified their data estate, improved data availability and integrity, and simplified observability and security

Benefits Realized

Significant reduction in data latency and improved data consistency
Improved data scalability by leveraging S3 as the primary storage layer
Simplified observability and security by centralizing data access through the Glue Data Catalog

Optimizing Iceberg-Powered Analytics on AWS

Query Federation and Catalog Integration

Catalog federation enables querying Iceberg tables across multiple catalogs, providing a unified view of data
This improves agile analytics and cross-source reporting capabilities

Materialized Views for Performance

Materialized views in Glue provide pre-computed aggregates stored as Iceberg tables, boosting query performance
Best practices include identifying high-frequency queries, optimizing storage, and managing refresh schedules

Integrating Iceberg with AWS Compute Services

Athena leverages Iceberg's metadata for efficient query execution and supports table management operations
Redshift integrates seamlessly with Iceberg, providing enterprise-scale concurrency and performance optimizations
EMR Spark provides flexible, scalable, and cost-optimized Iceberg analytics through features like adaptive query execution

Key Takeaways

Use S3 Table Buckets for managed Iceberg table maintenance and storage optimization
Leverage materialized views in Glue to pre-compute aggregates and boost query performance
Prioritize catalog management and federation to enable agile, cross-source analytics
Integrate Iceberg with AWS compute services like Athena, Redshift, and EMR Spark to unlock enterprise-grade analytics
Iceberg powers both batch and real-time data pipelines, enabling consistent, low-latency data access

Your Digital Journey deserves a great story.

Build one with us.

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

AWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

Building Apache Iceberg-Based Lakehouse Architectures on AWS

The Data Lake Crisis and the Rise of Apache Iceberg

The Challenges of Traditional Data Lakes

Key Benefits of Apache Iceberg

The AWS Stack for Iceberg-Powered Lakehouses

Integrated Ecosystem for Iceberg

Glue Data Catalog and Lake Formation

S3 Table Buckets for Storage Optimization

Agentized Lakehouse with Glue Data Catalog

Iceberg-Based Architecture Patterns

Batch ETL

Change Data Capture (CDC)

High-Concurrency Streaming

Medidata's Iceberg Transformation Journey

Previous Architecture Challenges

Iceberg-Based Solution

Benefits Realized

Optimizing Iceberg-Powered Analytics on AWS

Query Federation and Catalog Integration

Materialized Views for Performance

Integrating Iceberg with AWS Compute Services

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

Building Apache Iceberg-Based Lakehouse Architectures on AWS

The Data Lake Crisis and the Rise of Apache Iceberg

The Challenges of Traditional Data Lakes

Key Benefits of Apache Iceberg

The AWS Stack for Iceberg-Powered Lakehouses

Integrated Ecosystem for Iceberg

Glue Data Catalog and Lake Formation

S3 Table Buckets for Storage Optimization

Agentized Lakehouse with Glue Data Catalog

Iceberg-Based Architecture Patterns

Batch ETL

Change Data Capture (CDC)

High-Concurrency Streaming

Medidata's Iceberg Transformation Journey

Previous Architecture Challenges

Iceberg-Based Solution

Benefits Realized

Optimizing Iceberg-Powered Analytics on AWS

Query Federation and Catalog Integration

Materialized Views for Performance

Integrating Iceberg with AWS Compute Services

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.