TalksAWS re:Invent 2025 - Innovations in AWS analytics: Data processing (ANT305)
AWS re:Invent 2025 - Innovations in AWS analytics: Data processing (ANT305)
Summary of AWS re:Invent 2025 - Innovations in AWS Analytics: Data Processing (ANT305)
Introduction
Presenters: Skinuk Bahare (Head of Analytics Portfolio), Neil Mukharji, and Anjali Norbert (Netflix Engineering Manager)
Focus on innovations in data processing services like AWS Glue, EMR, Athena, and Amazon Redshift
Highlights the scale and growth of these services, with billions of queries and jobs executed per week against exabytes of data in Amazon S3
Key Innovations in Data Processing
AI-Powered Spark Upgrade
Challenge: Upgrading Spark runtime versions is difficult due to code and data consistency issues
Solution: AI-powered upgrade agent that automatically generates an upgrade plan, executes it, and validates data quality
Benefits:
Reduces Spark upgrade time from 6-12 months to minutes
Automatically handles code changes and data consistency checks
Provides observability and control over the upgrade process
Iceberg v3 Support
Iceberg is the latest table format for building data lakes on Amazon S3
EMR Spark runtime 7.12 supports Iceberg v3, which includes features like deletion vectors and row lineage
Enables more efficient data lake management by reducing write amplification and "smart delete" problems
Iceberg Materialized Views
Iceberg materialized views are pre-computed Iceberg tables that can be used to speed up queries
Automatically refreshed based on a defined schedule or when new data is available
Integrated into the AWS Glue Data Catalog for easy access from Athena, EMR Spark, and Glue Spark
Serverless Storage for EMR Spark
Eliminates the need for local disk provisioning for Spark workloads
Offloads shuffle data to a high-performance storage layer, improving Spark scaling and efficiency
Can result in up to 20% cost savings compared to traditional Spark deployments
Ease of Use Innovations
SageMaker Notebooks with AI-Powered Assistance
SageMaker Notebooks provide a unified authoring, execution, and debugging experience for Python and Spark workloads
Leverages Athena for Apache Spark to deliver high-performance Spark capabilities in a serverless environment
Includes an AI agent that can generate SQL, Python, and Spark code, as well as entire notebook plans, based on user prompts and data catalog understanding
Serverless Airflow
Fully managed, serverless deployment of Apache Airflow for data orchestration
Provides workflow-level security and isolation, eliminating the need for separate Airflow environments
Integrated into the SageMaker Unified Studio for easy authoring, monitoring, and management of workflows
Security and Governance Innovations
Coarse-Grained Access Control
S3 access grants for controlling read/write permissions to S3 buckets, prefixes, and objects
Catalog-level access control using AWS Lake Formation to grant users access to specific tables
Fine-Grained Access Control
Column, row, and cell-level security using AWS Lake Formation
Separation of system and user drivers to ensure security controls cannot be bypassed
Supports read and write access control for Iceberg, Delta Lake, and Hudi tables
Trusted Identity Propagation
Integration with AWS IAM Identity Center for end-to-end user identity and access traceability
Enables single sign-on and fine-grained permissions enforcement across EMR, Glue, and other services
Netflix's Experience with EMR
Netflix has been running a highly customized Spark platform on Hadoop for over 7 years
Evaluated EMR to address challenges around security, isolation, operational overhead, and support for specialized hardware
Conducted extensive testing, including feature compatibility, performance, scale, and operational complexity
Found significant performance improvements, especially for PySpark workloads, and reduced resource consumption
Decided to gradually migrate Netflix's Spark workflows to EMR, starting with internal platform workflows and then user-facing workloads
Identified areas for further exploration, such as EMR Serverless and AI acceleration capabilities
Key Takeaways
AWS continues to drive innovation in data processing services, with a focus on performance, ease of use, and security/governance
AI-powered capabilities, such as the Spark upgrade agent, can significantly streamline complex data engineering tasks
Iceberg v3 and materialized views provide advanced data lake management capabilities
Serverless storage for EMR Spark can improve efficiency and reduce costs
SageMaker Notebooks and serverless Airflow simplify data processing workflows and improve user experience
Comprehensive security and governance controls, including fine-grained access management, enable enterprises to safely adopt these services
Netflix's experience demonstrates the benefits of migrating to EMR, including performance gains and reduced operational overhead
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.