AWS re:Invent 2025 - Inside S3: Lessons from exabyte-scale data lake modernization (STG351)

Modernizing S3's Exabyte-Scale Data Lake

Challenges of Massive Data Volumes

S3 generates massive amounts of internal log data, with over 1 million requests per second

This data represents critical customer interactions and business insights, but is difficult to access and analyze at scale

Key challenges include:

Engineers spending more time finding data than analyzing it
Slow issue resolution due to difficulty accessing relevant logs
Valuable insights being buried in the "Mount Everest" of data

Improving Data Discoverability and Accessibility

Developed a centralized data catalog to make internal data sets more discoverable

Focused on ensuring data entries are useful and users are incentivized to maintain quality
Positioned the team as brokers of tooling rather than centralized data owners

Optimized the query engine with techniques like predicate and projection push-down

Allowed scanning less data to return only relevant results
Dramatically improved query performance and reduced resource consumption

Transforming the Data Layout with Apache Iceberg

Recognized limitations of text-based log formats and need for a more modern data platform

Chose Apache Iceberg to provide:

Transactional updates and data consistency
Schema and partition evolution
Time travel capabilities

Designed a three-layer schema:

Identity layer: Core identifiers like request ID, timestamps, etc.
Measurements and counters: Numerical data about request behavior and performance
Context layer: Additional debug and service-specific information

Efficient Data Migration and Ingestion

Developed a custom "transcoder" tool to convert existing text logs into optimized Parquet format

Maintained compatibility with existing log agents
Gradually rolled out changes with a focus on safety and reliability

Planned for future where log agents would directly write to aggregators, which would then ingest into Iceberg

Key Takeaways and Lessons Learned

Identify high-value but high-barrier internal data sets and focus on improving accessibility

Work backwards from user needs and questions, not just current capabilities

Meet systems and users where they are, minimizing disruption to existing workflows

Leverage technologies like Iceberg to provide a modern, scalable data foundation

Invest in efficient data migration and ingestion strategies to enable the transition

Business Impact

Returned thousands of engineering hours to building and operating core services

Enabled engineers to run arbitrary queries on fresh data within minutes

Empowered product managers and data scientists to access and analyze historical data much faster

Unlocked valuable insights and business intelligence previously buried in the data

Technical Details and Metrics

S3 receives over 1 million requests per second, generating terabytes to pebibytes of log data per hour

Custom "transcoder" tool compressed 1 hour of logs to Parquet in just 3 minutes

Iceberg-based data layout allowed queries to skip irrelevant data, delivering results in minutes instead of hours

Examples and Use Cases

Analyzing trends in feature usage by customer segment over 3 months

Identifying requests that hit the primary cache but took longer than 50ms to return

Tracking performance and failure rates of specific hardware components over time

AWS re:Invent 2025 - Inside S3: Lessons from exabyte-scale data lake modernization (STG351)

Modernizing S3's Exabyte-Scale Data Lake

Challenges of Massive Data Volumes

Improving Data Discoverability and Accessibility

Transforming the Data Layout with Apache Iceberg

Efficient Data Migration and Ingestion

Key Takeaways and Lessons Learned

Business Impact

Technical Details and Metrics

Examples and Use Cases

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Inside S3: Lessons from exabyte-scale data lake modernization (STG351)

Modernizing S3's Exabyte-Scale Data Lake

Challenges of Massive Data Volumes

Improving Data Discoverability and Accessibility

Transforming the Data Layout with Apache Iceberg

Efficient Data Migration and Ingestion

Key Takeaways and Lessons Learned

Business Impact

Technical Details and Metrics

Examples and Use Cases

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.