TalksAWS re:Invent 2025 - Innovations in AWS analytics: Data processing (ANT305)

AWS re:Invent 2025 - Innovations in AWS analytics: Data processing (ANT305)

Summary of AWS re:Invent 2025 - Innovations in AWS Analytics: Data Processing (ANT305)

Introduction

  • Presenters: Skinuk Bahare (Head of Analytics Portfolio), Neil Mukharji, and Anjali Norbert (Netflix Engineering Manager)
  • Focus on innovations in data processing services like AWS Glue, EMR, Athena, and Amazon Redshift
  • Highlights the scale and growth of these services, with billions of queries and jobs executed per week against exabytes of data in Amazon S3

Key Innovations in Data Processing

AI-Powered Spark Upgrade

  • Challenge: Upgrading Spark runtime versions is difficult due to code and data consistency issues
  • Solution: AI-powered upgrade agent that automatically generates an upgrade plan, executes it, and validates data quality
  • Benefits:
    • Reduces Spark upgrade time from 6-12 months to minutes
    • Automatically handles code changes and data consistency checks
    • Provides observability and control over the upgrade process

Iceberg v3 Support

  • Iceberg is the latest table format for building data lakes on Amazon S3
  • EMR Spark runtime 7.12 supports Iceberg v3, which includes features like deletion vectors and row lineage
  • Enables more efficient data lake management by reducing write amplification and "smart delete" problems

Iceberg Materialized Views

  • Iceberg materialized views are pre-computed Iceberg tables that can be used to speed up queries
  • Automatically refreshed based on a defined schedule or when new data is available
  • Integrated into the AWS Glue Data Catalog for easy access from Athena, EMR Spark, and Glue Spark

Serverless Storage for EMR Spark

  • Eliminates the need for local disk provisioning for Spark workloads
  • Offloads shuffle data to a high-performance storage layer, improving Spark scaling and efficiency
  • Can result in up to 20% cost savings compared to traditional Spark deployments

Ease of Use Innovations

SageMaker Notebooks with AI-Powered Assistance

  • SageMaker Notebooks provide a unified authoring, execution, and debugging experience for Python and Spark workloads
  • Leverages Athena for Apache Spark to deliver high-performance Spark capabilities in a serverless environment
  • Includes an AI agent that can generate SQL, Python, and Spark code, as well as entire notebook plans, based on user prompts and data catalog understanding

Serverless Airflow

  • Fully managed, serverless deployment of Apache Airflow for data orchestration
  • Provides workflow-level security and isolation, eliminating the need for separate Airflow environments
  • Integrated into the SageMaker Unified Studio for easy authoring, monitoring, and management of workflows

Security and Governance Innovations

Coarse-Grained Access Control

  • S3 access grants for controlling read/write permissions to S3 buckets, prefixes, and objects
  • Catalog-level access control using AWS Lake Formation to grant users access to specific tables

Fine-Grained Access Control

  • Column, row, and cell-level security using AWS Lake Formation
  • Separation of system and user drivers to ensure security controls cannot be bypassed
  • Supports read and write access control for Iceberg, Delta Lake, and Hudi tables

Trusted Identity Propagation

  • Integration with AWS IAM Identity Center for end-to-end user identity and access traceability
  • Enables single sign-on and fine-grained permissions enforcement across EMR, Glue, and other services

Netflix's Experience with EMR

  • Netflix has been running a highly customized Spark platform on Hadoop for over 7 years
  • Evaluated EMR to address challenges around security, isolation, operational overhead, and support for specialized hardware
  • Conducted extensive testing, including feature compatibility, performance, scale, and operational complexity
  • Found significant performance improvements, especially for PySpark workloads, and reduced resource consumption
  • Decided to gradually migrate Netflix's Spark workflows to EMR, starting with internal platform workflows and then user-facing workloads
  • Identified areas for further exploration, such as EMR Serverless and AI acceleration capabilities

Key Takeaways

  • AWS continues to drive innovation in data processing services, with a focus on performance, ease of use, and security/governance
  • AI-powered capabilities, such as the Spark upgrade agent, can significantly streamline complex data engineering tasks
  • Iceberg v3 and materialized views provide advanced data lake management capabilities
  • Serverless storage for EMR Spark can improve efficiency and reduce costs
  • SageMaker Notebooks and serverless Airflow simplify data processing workflows and improve user experience
  • Comprehensive security and governance controls, including fine-grained access management, enable enterprises to safely adopt these services
  • Netflix's experience demonstrates the benefits of migrating to EMR, including performance gains and reduced operational overhead

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.