TalksAWS re:Invent 2025 - Inside S3: Lessons from exabyte-scale data lake modernization (STG351)

AWS re:Invent 2025 - Inside S3: Lessons from exabyte-scale data lake modernization (STG351)

Modernizing S3's Exabyte-Scale Data Lake

Challenges of Massive Data Volumes

  • S3 generates massive amounts of internal log data, with over 1 million requests per second
  • This data represents critical customer interactions and business insights, but is difficult to access and analyze at scale
  • Key challenges include:
    • Engineers spending more time finding data than analyzing it
    • Slow issue resolution due to difficulty accessing relevant logs
    • Valuable insights being buried in the "Mount Everest" of data

Improving Data Discoverability and Accessibility

  • Developed a centralized data catalog to make internal data sets more discoverable
    • Focused on ensuring data entries are useful and users are incentivized to maintain quality
    • Positioned the team as brokers of tooling rather than centralized data owners
  • Optimized the query engine with techniques like predicate and projection push-down
    • Allowed scanning less data to return only relevant results
    • Dramatically improved query performance and reduced resource consumption

Transforming the Data Layout with Apache Iceberg

  • Recognized limitations of text-based log formats and need for a more modern data platform
  • Chose Apache Iceberg to provide:
    • Transactional updates and data consistency
    • Schema and partition evolution
    • Time travel capabilities
  • Designed a three-layer schema:
    1. Identity layer: Core identifiers like request ID, timestamps, etc.
    2. Measurements and counters: Numerical data about request behavior and performance
    3. Context layer: Additional debug and service-specific information

Efficient Data Migration and Ingestion

  • Developed a custom "transcoder" tool to convert existing text logs into optimized Parquet format
    • Maintained compatibility with existing log agents
    • Gradually rolled out changes with a focus on safety and reliability
  • Planned for future where log agents would directly write to aggregators, which would then ingest into Iceberg

Key Takeaways and Lessons Learned

  • Identify high-value but high-barrier internal data sets and focus on improving accessibility
  • Work backwards from user needs and questions, not just current capabilities
  • Meet systems and users where they are, minimizing disruption to existing workflows
  • Leverage technologies like Iceberg to provide a modern, scalable data foundation
  • Invest in efficient data migration and ingestion strategies to enable the transition

Business Impact

  • Returned thousands of engineering hours to building and operating core services
  • Enabled engineers to run arbitrary queries on fresh data within minutes
  • Empowered product managers and data scientists to access and analyze historical data much faster
  • Unlocked valuable insights and business intelligence previously buried in the data

Technical Details and Metrics

  • S3 receives over 1 million requests per second, generating terabytes to pebibytes of log data per hour
  • Custom "transcoder" tool compressed 1 hour of logs to Parquet in just 3 minutes
  • Iceberg-based data layout allowed queries to skip irrelevant data, delivering results in minutes instead of hours

Examples and Use Cases

  • Analyzing trends in feature usage by customer segment over 3 months
  • Identifying requests that hit the primary cache but took longer than 50ms to return
  • Tracking performance and failure rates of specific hardware components over time

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.