Scaling Pinterest: Iceberg Solutions for Petabyte-Scale Challenges
Pinterest's Data Scale and Challenges
Pinterest has over 600 million monthly active users and saves around 1.5 billion pins per week, resulting in a 500 petabyte data lake on Amazon S3.
Pinterest's data infrastructure includes over 100,000 Hive and Iceberg tables, 20,000 Spark nodes, and 1,000 Trino (formerly Presto) nodes, running 400,000 compute jobs per day.
The traditional Hive table format was no longer sufficient to handle Pinterest's growing data and use case requirements, leading them to explore alternatives like Apache Iceberg.
Adopting Apache Iceberg at Pinterest
Pinterest's transition to Iceberg took around 2 years, with the first production use case being user data deletion in 2023.
Currently, Pinterest has around 15,000 Iceberg tables, accounting for 200 petabytes of data, with the Iceberg table count growing over 300% and data volume growth being more controlled.
Pinterest enables Iceberg usage across various engines, including Trino, Spark, Flink, and Python, supporting metadata reads and batch frameworks.
Key Use Cases Powered by Iceberg
User Data Deletion:
In the Hive world, user data deletion required rewriting entire tables, which was expensive and unreliable.
Iceberg enabled selective file-level deletions, leveraging snapshot isolation to ensure ongoing queries are unaffected.
Further optimizations, including sorting tables by the deletion key and contributing changes to Iceberg and Spark, allowed Pinterest to scale their deletion capability by 10x, reduce data storage costs by 30%, and compute costs by 30%, while increasing reliability by 90%.
Table Sampling:
Pinterest's machine learning engineers and data scientists required reproducible sampling for data exploration and joining large tables.
Iceberg's bucket-based sampling approach allowed Pinterest to guarantee the same keys are present in sampled outputs from multiple tables, enabling reproducible and consistent sampling.
This resulted in 90% speed-ups for users and less than 1% deviation from full table scans.
Feature Backfills:
Pinterest's machine learning models rely on a wide range of features, and engineers constantly experiment with new features.
The traditional approach of forward logging new features was expensive due to the high cost of joins on petabyte-scale data.
Iceberg's storage-partitioned joins allowed Pinterest to enable feature backfilling, resulting in 90x faster feature development and 65% cost savings on large joins.
Lessons Learned from Operating Iceberg at Scale on Amazon S3
User Agent-Based Access Control:
To prevent accidental writes to Iceberg data sets, Pinterest implemented user agent-based access control policies in Amazon S3, allowing only Iceberg-capable clients to modify the data.
Leveraging S3 Request Logs and Inventory:
Pinterest used S3 request logs and inventory reports to identify and remove orphaned, unreferenced files in their Iceberg data sets, ensuring efficient storage utilization.
They also used these tools to monitor and validate legitimate access patterns to their Iceberg tables.
Addressing S3 Throttling:
During in-place Hive to Iceberg migrations, Pinterest encountered S3 throttling issues due to the static nature of their object paths.
By adopting Amazon's recommended approach of using early entropy in object paths, and leveraging Iceberg's 20-bit base hashing feature, Pinterest was able to eliminate user complaints about S3 throttling.
Key Takeaways
Iceberg enabled Pinterest to address their growing data scale and use case challenges, providing significant benefits in user data deletion, table sampling, and feature backfilling.
Careful planning, optimizations, and contributions to the Iceberg ecosystem were crucial for Pinterest to successfully operate Iceberg at their massive scale on Amazon S3.
Leveraging Amazon S3 features like access control, request logs, and inventory reports was instrumental in managing Iceberg data sets effectively.
The lessons learned by Pinterest can serve as valuable guidance for other organizations facing similar petabyte-scale data challenges and considering the adoption of Iceberg.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.