TalksAWS re:Invent 2025 - Enterprise-scale ETL optimization for Apache Spark (ANT336)

AWS re:Invent 2025 - Enterprise-scale ETL optimization for Apache Spark (ANT336)

Enterprise-scale ETL Optimization for Apache Spark

Challenges in Modern ETL Pipelines

  • Businesses move fast, but data systems can feel stuck in the past
  • Enterprise customers spend significant time tweaking Spark configurations, with many jobs either overprovisioned or failing altogether
  • Key challenges include:
    • Security: Manually enforcing row and column-level rules, with scattered policies and inconsistent governance
    • Performance: Spark job slowdowns from metadata bottlenecks, redundant scans, and encryption complexities
    • Consistency: Inconsistent Spark behavior across EMR, Glue, and Athena, leading to a fragmented ecosystem

Reimagining the ETL Experience on AWS

  • Secure, unified, and high-performance Spark platform built into the AWS ecosystem
  • Three key pillars:
    1. Security: Fine-grained access control, native integration with AWS Lake Formation, and multi-select views with invoker-aware permissions
    2. Unification: Consistent Spark runtime across EMR, Glue, and Athena, with default use of S3A connectors and support for all S3 storage classes
    3. Performance: Faster Iceberg reads/writes, optimized JSON processing, and materialized views for dramatic query time reduction

Unified Security with AWS Lake Formation

  • Evolution of Lake Formation integration with Spark:
    • 2022: Centralized "record server" component enforcing table-level permissions
    • 2023: Fine-grained access control (FGAC) for row, column, and cell-level filtering
    • 2024: Spark runtime directly applying Lake Formation permissions, no more proxies
    • 2025: Fully native integration, with Spark directly reading, writing, and managing tables under Lake Formation governance
  • Benefits of FGAC vs. full table access control (FTAC):
    • FGAC provides row and column-level data segregation, important for interactive analytics and compliance
    • FTAC offers complete visibility and flexibility for trusted ETL pipelines, ML preparation, and batch workloads

Unified SQL Views Across Analytics Engines

  • Challenge: Teams often rewrite the same business logic separately for Athena, Glue, and EMR Spark, leading to duplicated logic, drift, and governance issues
  • AWS Data Catalog Views provide:
    • Unified governance: Single logical view definition, with Lake Formation enforcing permissions consistently
    • Cross-engine consistency: Same result across Athena, Glue, Spark, etc. due to shared underlying logic
    • Faster collaboration: Teams can reuse the same view logic without rewrites or coordination

Unified Spark Runtime Across AWS

  • Historical context: EMR originally used a custom EMRFS connector, while the open-source community developed S3A
  • Recent unification efforts:
    • EMR 7.12, Glue 5.1, and Athena now use the same Spark 3.5.6 runtime, Iceberg 1.10.0, Hive 1.0.2, and other libraries
    • S3A is now the default storage connector, with alignment to the open-source community
    • S3A provides access to all S3 storage classes, including Glacier, enabling new ETL use cases

Performance Optimizations

  • Materialized Views:
    • Caching layer that can be used to optimize ETL pipelines by pre-filtering data, caching intermediate results, and maintaining reporting views
    • Integrated with EMR, Glue, and Athena Spark, with automatic refresh managed by AWS
    • Can provide up to 8x query performance improvements by rewriting jobs to leverage the materialized view
  • Other performance enhancements:
    • 4.5x faster Iceberg reads, 2x faster Iceberg writes compared to open-source
    • 20% faster JSON processing
    • 85% reduction in encryption overhead, resulting in 20% faster jobs
    • 10-96x improvements for common string manipulation functions like uppercase, lowercase, trim, length, and reverse

Key Takeaways

  • AWS has unified the Spark experience across EMR, Glue, and Athena, providing a consistent runtime, connectors, and capabilities
  • Security is now built-in natively, with fine-grained access control and centralized governance through AWS Lake Formation
  • Performance has been significantly optimized, with materialized views, Iceberg improvements, and optimizations for common ETL patterns
  • These enhancements enable enterprises to build scalable, secure, and high-performance ETL pipelines on AWS without the previous complexities and operational overhead

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.