How Ford unlocked real-time insights using Apache Iceberg on AWS (AUT311)

Connected Vehicles and Modernizing Data Lakes with Apache Iceberg on AWS

Connected Vehicles: Overview

  • Connected vehicles are enabling new remote operations and enhancing customer experience through real-time data insights and operational efficiency.
  • Key connected vehicle use cases include:
    • Predictive maintenance
    • Proactive maintenance
    • Vehicle health monitoring
    • Vehicle event tracking

Challenges in Connected Vehicle Data Management

  • Data volume growth as more vehicles get connected
  • Need for real-time data consumption for remote functions
  • Data lake scalability to handle high volume and concurrency without impacting end-users

Modernizing Data Lakes with Apache Iceberg

Evolution of Modern Data Lakes

  • Transition from traditional relational databases and data warehouses to open table formats like Apache Hudi, Apache Iceberg, and Delta Lake.
  • These formats provide a blend of traditional data lake best practices and the open ecosystem of big data analytics.

Advantages of Apache Iceberg

  • ACID compliance for data reliability and consistency
  • Schema enforcement and evolution to handle changing data structures
  • Scalability and performance to support petabyte-scale data and high write throughput

Iceberg Benefits for Connected Vehicle Platforms

  • Reliable and consistent data ingestion, even during system failures
  • Seamless schema changes to accommodate new vehicle telemetry and features
  • Scalable and performant data lake to handle massive data volumes and concurrency

Ford's Journey with Connected Vehicle Event Store using Apache Iceberg

Ford's Connected Vehicle Platform Overview

  • Manages over 20 million vehicles globally, enabling bidirectional data exchange and remote vehicle operations.

Building the Event Store Platform

  1. Initial scope: High-cardinality data with moderate freshness requirements.
  2. Rapid growth and scalability challenges:
    • Increased job processing times
    • Degraded query performance
    • Rising storage and compute costs
    • Data aging issues

Optimizing the Platform with Apache Iceberg

  1. Clean Zone Improvements:

    • Leveraging Glue's lazy loading to optimize file listing
    • Replacing custom UDFs with Spark native functions
    • Moving to a "backet of files" approach
  2. Migrating to Apache Iceberg:

    • Creating a view to seamlessly transition to Iceberg tables
    • Leveraging Iceberg's table compaction to address small file issues
    • Achieving 80% improvement in query performance

Future Enhancements: Streaming with Iceberg

  • Adopting a streaming approach to provide real-time access to critical vehicle data
  • Maintaining a backup raw data layer using the streaming approach

Conclusion: The Transformative Power of Apache Iceberg on AWS

  • Iceberg's features, such as ACID compliance, schema management, and scalability, are key to modernizing connected vehicle data lakes.
  • AWS provides comprehensive support for Apache Iceberg through various analytics services, including EMR, Glue, Athena, Sagemaker, and Redshift.
  • The integration with AWS Glue Catalog and S3 storage ensures a cost-effective, scalable, and optimized transactional data lake.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us