AWS re:Invent 2025 - Enterprise-scale ETL optimization for Apache Spark (ANT336)

Enterprise-scale ETL Optimization for Apache Spark

Challenges in Modern ETL Pipelines

Businesses move fast, but data systems can feel stuck in the past

Enterprise customers spend significant time tweaking Spark configurations, with many jobs either overprovisioned or failing altogether

Key challenges include:

Security: Manually enforcing row and column-level rules, with scattered policies and inconsistent governance
Performance: Spark job slowdowns from metadata bottlenecks, redundant scans, and encryption complexities
Consistency: Inconsistent Spark behavior across EMR, Glue, and Athena, leading to a fragmented ecosystem

Reimagining the ETL Experience on AWS

Secure, unified, and high-performance Spark platform built into the AWS ecosystem

Three key pillars:

Security: Fine-grained access control, native integration with AWS Lake Formation, and multi-select views with invoker-aware permissions
Unification: Consistent Spark runtime across EMR, Glue, and Athena, with default use of S3A connectors and support for all S3 storage classes
Performance: Faster Iceberg reads/writes, optimized JSON processing, and materialized views for dramatic query time reduction

Unified Security with AWS Lake Formation

Evolution of Lake Formation integration with Spark:

2022: Centralized "record server" component enforcing table-level permissions
2023: Fine-grained access control (FGAC) for row, column, and cell-level filtering
2024: Spark runtime directly applying Lake Formation permissions, no more proxies
2025: Fully native integration, with Spark directly reading, writing, and managing tables under Lake Formation governance

Benefits of FGAC vs. full table access control (FTAC):

FGAC provides row and column-level data segregation, important for interactive analytics and compliance
FTAC offers complete visibility and flexibility for trusted ETL pipelines, ML preparation, and batch workloads

Unified SQL Views Across Analytics Engines

Challenge: Teams often rewrite the same business logic separately for Athena, Glue, and EMR Spark, leading to duplicated logic, drift, and governance issues

AWS Data Catalog Views provide:

Unified governance: Single logical view definition, with Lake Formation enforcing permissions consistently
Cross-engine consistency: Same result across Athena, Glue, Spark, etc. due to shared underlying logic
Faster collaboration: Teams can reuse the same view logic without rewrites or coordination

Unified Spark Runtime Across AWS

Historical context: EMR originally used a custom EMRFS connector, while the open-source community developed S3A

Recent unification efforts:

EMR 7.12, Glue 5.1, and Athena now use the same Spark 3.5.6 runtime, Iceberg 1.10.0, Hive 1.0.2, and other libraries
S3A is now the default storage connector, with alignment to the open-source community
S3A provides access to all S3 storage classes, including Glacier, enabling new ETL use cases

Performance Optimizations

Materialized Views:

Caching layer that can be used to optimize ETL pipelines by pre-filtering data, caching intermediate results, and maintaining reporting views
Integrated with EMR, Glue, and Athena Spark, with automatic refresh managed by AWS
Can provide up to 8x query performance improvements by rewriting jobs to leverage the materialized view

Other performance enhancements:

4.5x faster Iceberg reads, 2x faster Iceberg writes compared to open-source
20% faster JSON processing
85% reduction in encryption overhead, resulting in 20% faster jobs
10-96x improvements for common string manipulation functions like uppercase, lowercase, trim, length, and reverse

Key Takeaways

AWS has unified the Spark experience across EMR, Glue, and Athena, providing a consistent runtime, connectors, and capabilities

Security is now built-in natively, with fine-grained access control and centralized governance through AWS Lake Formation

Performance has been significantly optimized, with materialized views, Iceberg improvements, and optimizations for common ETL patterns

These enhancements enable enterprises to build scalable, secure, and high-performance ETL pipelines on AWS without the previous complexities and operational overhead

AWS re:Invent 2025 - Enterprise-scale ETL optimization for Apache Spark (ANT336)

Enterprise-scale ETL Optimization for Apache Spark

Challenges in Modern ETL Pipelines

Reimagining the ETL Experience on AWS

Unified Security with AWS Lake Formation

Unified SQL Views Across Analytics Engines

Unified Spark Runtime Across AWS

Performance Optimizations

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Enterprise-scale ETL optimization for Apache Spark (ANT336)

Enterprise-scale ETL Optimization for Apache Spark

Challenges in Modern ETL Pipelines

Reimagining the ETL Experience on AWS

Unified Security with AWS Lake Formation

Unified SQL Views Across Analytics Engines

Unified Spark Runtime Across AWS

Performance Optimizations

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.