Efficient incremental processing with Apache Iceberg at Netflix (NFX303)
Efficient Incremental Processing with Apache Iceberg and Netflix Maestro
Introduction
Netflix is a data-driven company, where data insights derived from various data pipelines power critical business decisions.
As the business expands to new domains like gaming, ads, and live streaming, the demand for data continues to grow, leading to common challenges of data accuracy, freshness, and cost efficiency.
One major challenge is dealing with later arriving data, where events generated at a particular time are processed hours or days later, leading to data accuracy, freshness, and cost efficiency issues.
Architectural Design
To address these challenges, Netflix developed a solution using Apache Iceberg and Netflix Maestro.
Iceberg is a high-performance format for large analytics tables, providing rich metadata to track data changes without reading the user data.
Maestro is a horizontally scalable workflow orchestrator, providing a well-paved path for users to develop and manage their workflows.
Incremental Change Capturing
Leveraging Iceberg's metadata, Netflix built a zero-data-copy incremental change capturing table that contains only the references to the changed data files, without copying the actual data.
This approach ensures data accuracy, security, and privacy compliance, while being efficient.
User Onboarding Experience
To provide a seamless onboarding experience for users, Netflix decoupled the change capturing from the business logic, introducing two new Maestro step types: IpCapture and IpCommit.
Users can simply add these steps to their existing workflows, enabling incremental processing support with minimal code changes.
Use Cases and Examples
Incremental Processing and Appending to Target Table:
Users can process the change data from the incremental change table and append it to the target table, avoiding full reprocessing.
Change Data as a Filter to Reduce Transformation:
Users can use the change data to efficiently filter the original table and only process the necessary data for complex transformations.
Captured Range Parameters in Business Logic:
Users can leverage the captured range information from the incremental change table to precisely define the data they need to process, without guesswork.
Auto Remediation Pattern:
Maestro's conditional branching allows users to implement an auto-remediation workflow, where failed jobs are automatically retried, reducing manual intervention.
Key Takeaways and Future Work
The incremental processing solution using Iceberg and Maestro enables new patterns and use cases, addressing data accuracy, freshness, and cost efficiency challenges.
The zero-data-copy approach and the clean interfaces provide a seamless onboarding experience for users, with minimal code changes.
Future work includes adding SQL extension support for incremental change views in Iceberg, supporting different types of snapshots, and implementing multi-stage auto-cascading data backfill support.
Acknowledgments
The presenter acknowledges the incredible teamwork of engineers and leaders at Netflix, as well as the AWS team for hosting the event.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.