TalksAWS re:Invent 2025 - Innovations in AWS analytics: Data processing (ANT305)

AWS re:Invent 2025 - Innovations in AWS analytics: Data processing (ANT305)

Summary of AWS re:Invent 2025 - Innovations in AWS Analytics: Data Processing (ANT305)

Introduction

Presenters: Skinuk Bahare (Head of Analytics Portfolio), Neil Mukharji, and Anjali Norbert (Netflix Engineering Manager)
Focus on innovations in data processing services like AWS Glue, EMR, Athena, and Amazon Redshift
Highlights the scale and growth of these services, with billions of queries and jobs executed per week against exabytes of data in Amazon S3

Key Innovations in Data Processing

AI-Powered Spark Upgrade

Challenge: Upgrading Spark runtime versions is difficult due to code and data consistency issues
Solution: AI-powered upgrade agent that automatically generates an upgrade plan, executes it, and validates data quality
Benefits:
- Reduces Spark upgrade time from 6-12 months to minutes
- Automatically handles code changes and data consistency checks
- Provides observability and control over the upgrade process

Iceberg v3 Support

Iceberg is the latest table format for building data lakes on Amazon S3
EMR Spark runtime 7.12 supports Iceberg v3, which includes features like deletion vectors and row lineage
Enables more efficient data lake management by reducing write amplification and "smart delete" problems

Iceberg Materialized Views

Iceberg materialized views are pre-computed Iceberg tables that can be used to speed up queries
Automatically refreshed based on a defined schedule or when new data is available
Integrated into the AWS Glue Data Catalog for easy access from Athena, EMR Spark, and Glue Spark

Serverless Storage for EMR Spark

Eliminates the need for local disk provisioning for Spark workloads
Offloads shuffle data to a high-performance storage layer, improving Spark scaling and efficiency
Can result in up to 20% cost savings compared to traditional Spark deployments

Ease of Use Innovations

SageMaker Notebooks with AI-Powered Assistance

SageMaker Notebooks provide a unified authoring, execution, and debugging experience for Python and Spark workloads
Leverages Athena for Apache Spark to deliver high-performance Spark capabilities in a serverless environment
Includes an AI agent that can generate SQL, Python, and Spark code, as well as entire notebook plans, based on user prompts and data catalog understanding

Serverless Airflow

Fully managed, serverless deployment of Apache Airflow for data orchestration
Provides workflow-level security and isolation, eliminating the need for separate Airflow environments
Integrated into the SageMaker Unified Studio for easy authoring, monitoring, and management of workflows

Security and Governance Innovations

Coarse-Grained Access Control

S3 access grants for controlling read/write permissions to S3 buckets, prefixes, and objects
Catalog-level access control using AWS Lake Formation to grant users access to specific tables

Fine-Grained Access Control

Column, row, and cell-level security using AWS Lake Formation
Separation of system and user drivers to ensure security controls cannot be bypassed
Supports read and write access control for Iceberg, Delta Lake, and Hudi tables

Trusted Identity Propagation

Integration with AWS IAM Identity Center for end-to-end user identity and access traceability
Enables single sign-on and fine-grained permissions enforcement across EMR, Glue, and other services

Netflix's Experience with EMR

Netflix has been running a highly customized Spark platform on Hadoop for over 7 years
Evaluated EMR to address challenges around security, isolation, operational overhead, and support for specialized hardware
Conducted extensive testing, including feature compatibility, performance, scale, and operational complexity
Found significant performance improvements, especially for PySpark workloads, and reduced resource consumption
Decided to gradually migrate Netflix's Spark workflows to EMR, starting with internal platform workflows and then user-facing workloads
Identified areas for further exploration, such as EMR Serverless and AI acceleration capabilities

Key Takeaways

AWS continues to drive innovation in data processing services, with a focus on performance, ease of use, and security/governance
AI-powered capabilities, such as the Spark upgrade agent, can significantly streamline complex data engineering tasks
Iceberg v3 and materialized views provide advanced data lake management capabilities
Serverless storage for EMR Spark can improve efficiency and reduce costs
SageMaker Notebooks and serverless Airflow simplify data processing workflows and improve user experience
Comprehensive security and governance controls, including fine-grained access management, enable enterprises to safely adopt these services
Netflix's experience demonstrates the benefits of migrating to EMR, including performance gains and reduced operational overhead

Your Digital Journey deserves a great story.

Build one with us.

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

AWS re:Invent 2025 - Innovations in AWS analytics: Data processing (ANT305)

Summary of AWS re:Invent 2025 - Innovations in AWS Analytics: Data Processing (ANT305)

Introduction

Key Innovations in Data Processing

AI-Powered Spark Upgrade

Iceberg v3 Support

Iceberg Materialized Views

Serverless Storage for EMR Spark

Ease of Use Innovations

SageMaker Notebooks with AI-Powered Assistance

Serverless Airflow

Security and Governance Innovations

Coarse-Grained Access Control

Fine-Grained Access Control

Trusted Identity Propagation

Netflix's Experience with EMR

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Innovations in AWS analytics: Data processing (ANT305)

Summary of AWS re:Invent 2025 - Innovations in AWS Analytics: Data Processing (ANT305)

Introduction

Key Innovations in Data Processing

AI-Powered Spark Upgrade

Iceberg v3 Support

Iceberg Materialized Views

Serverless Storage for EMR Spark

Ease of Use Innovations

SageMaker Notebooks with AI-Powered Assistance

Serverless Airflow

Security and Governance Innovations

Coarse-Grained Access Control

Fine-Grained Access Control

Trusted Identity Propagation

Netflix's Experience with EMR

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.