Efficient incremental processing with Apache Iceberg at Netflix (NFX303)

Introduction

Netflix is a data-driven company, where data insights derived from various data pipelines power critical business decisions.

As the business expands to new domains like gaming, ads, and live streaming, the demand for data continues to grow, leading to common challenges of data accuracy, freshness, and cost efficiency.

One major challenge is dealing with later arriving data, where events generated at a particular time are processed hours or days later, leading to data accuracy, freshness, and cost efficiency issues.

Architectural Design

To address these challenges, Netflix developed a solution using Apache Iceberg and Netflix Maestro.

Iceberg is a high-performance format for large analytics tables, providing rich metadata to track data changes without reading the user data.

Maestro is a horizontally scalable workflow orchestrator, providing a well-paved path for users to develop and manage their workflows.

Incremental Change Capturing

Leveraging Iceberg's metadata, Netflix built a zero-data-copy incremental change capturing table that contains only the references to the changed data files, without copying the actual data.

This approach ensures data accuracy, security, and privacy compliance, while being efficient.

User Onboarding Experience

To provide a seamless onboarding experience for users, Netflix decoupled the change capturing from the business logic, introducing two new Maestro step types: IpCapture and IpCommit.

Users can simply add these steps to their existing workflows, enabling incremental processing support with minimal code changes.

Use Cases and Examples

Incremental Processing and Appending to Target Table:

Users can process the change data from the incremental change table and append it to the target table, avoiding full reprocessing.

Change Data as a Filter to Reduce Transformation:

Users can use the change data to efficiently filter the original table and only process the necessary data for complex transformations.

Captured Range Parameters in Business Logic:

Users can leverage the captured range information from the incremental change table to precisely define the data they need to process, without guesswork.

Auto Remediation Pattern:

Maestro's conditional branching allows users to implement an auto-remediation workflow, where failed jobs are automatically retried, reducing manual intervention.

Key Takeaways and Future Work

The incremental processing solution using Iceberg and Maestro enables new patterns and use cases, addressing data accuracy, freshness, and cost efficiency challenges.

The zero-data-copy approach and the clean interfaces provide a seamless onboarding experience for users, with minimal code changes.

Future work includes adding SQL extension support for incremental change views in Iceberg, supporting different types of snapshots, and implementing multi-stage auto-cascading data backfill support.

Efficient incremental processing with Apache Iceberg at Netflix (NFX303)

Efficient Incremental Processing with Apache Iceberg and Netflix Maestro

Introduction

Architectural Design

Incremental Change Capturing

User Onboarding Experience

Use Cases and Examples

Key Takeaways and Future Work

Acknowledgments

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

Efficient incremental processing with Apache Iceberg at Netflix (NFX303)

Efficient Incremental Processing with Apache Iceberg and Netflix Maestro

Introduction

Architectural Design

Incremental Change Capturing

User Onboarding Experience

Use Cases and Examples

Key Takeaways and Future Work

Acknowledgments

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.