Talks AWS re:Invent 2025 - Inside S3: Lessons from exabyte-scale data lake modernization (STG351) VIDEO
AWS re:Invent 2025 - Inside S3: Lessons from exabyte-scale data lake modernization (STG351) Modernizing S3's Exabyte-Scale Data Lake
Challenges of Massive Data Volumes
S3 generates massive amounts of internal log data, with over 1 million requests per second
This data represents critical customer interactions and business insights, but is difficult to access and analyze at scale
Key challenges include:
Engineers spending more time finding data than analyzing it
Slow issue resolution due to difficulty accessing relevant logs
Valuable insights being buried in the "Mount Everest" of data
Improving Data Discoverability and Accessibility
Developed a centralized data catalog to make internal data sets more discoverable
Focused on ensuring data entries are useful and users are incentivized to maintain quality
Positioned the team as brokers of tooling rather than centralized data owners
Optimized the query engine with techniques like predicate and projection push-down
Allowed scanning less data to return only relevant results
Dramatically improved query performance and reduced resource consumption
Transforming the Data Layout with Apache Iceberg
Recognized limitations of text-based log formats and need for a more modern data platform
Chose Apache Iceberg to provide:
Transactional updates and data consistency
Schema and partition evolution
Time travel capabilities
Designed a three-layer schema:
Identity layer: Core identifiers like request ID, timestamps, etc.
Measurements and counters: Numerical data about request behavior and performance
Context layer: Additional debug and service-specific information
Efficient Data Migration and Ingestion
Developed a custom "transcoder" tool to convert existing text logs into optimized Parquet format
Maintained compatibility with existing log agents
Gradually rolled out changes with a focus on safety and reliability
Planned for future where log agents would directly write to aggregators, which would then ingest into Iceberg
Key Takeaways and Lessons Learned
Identify high-value but high-barrier internal data sets and focus on improving accessibility
Work backwards from user needs and questions, not just current capabilities
Meet systems and users where they are, minimizing disruption to existing workflows
Leverage technologies like Iceberg to provide a modern, scalable data foundation
Invest in efficient data migration and ingestion strategies to enable the transition
Business Impact
Returned thousands of engineering hours to building and operating core services
Enabled engineers to run arbitrary queries on fresh data within minutes
Empowered product managers and data scientists to access and analyze historical data much faster
Unlocked valuable insights and business intelligence previously buried in the data
Technical Details and Metrics
S3 receives over 1 million requests per second, generating terabytes to pebibytes of log data per hour
Custom "transcoder" tool compressed 1 hour of logs to Parquet in just 3 minutes
Iceberg-based data layout allowed queries to skip irrelevant data, delivering results in minutes instead of hours
Examples and Use Cases
Analyzing trends in feature usage by customer segment over 3 months
Identifying requests that hit the primary cache but took longer than 50ms to return
Tracking performance and failure rates of specific hardware components over time
Your Digital Journey deserves a great story. Build one with us.