How Ford unlocked real-time insights using Apache Iceberg on AWS (AUT311)
Connected Vehicles and Modernizing Data Lakes with Apache Iceberg on AWS
Connected Vehicles: Overview
Connected vehicles are enabling new remote operations and enhancing customer experience through real-time data insights and operational efficiency.
Key connected vehicle use cases include:
Predictive maintenance
Proactive maintenance
Vehicle health monitoring
Vehicle event tracking
Challenges in Connected Vehicle Data Management
Data volume growth as more vehicles get connected
Need for real-time data consumption for remote functions
Data lake scalability to handle high volume and concurrency without impacting end-users
Modernizing Data Lakes with Apache Iceberg
Evolution of Modern Data Lakes
Transition from traditional relational databases and data warehouses to open table formats like Apache Hudi, Apache Iceberg, and Delta Lake.
These formats provide a blend of traditional data lake best practices and the open ecosystem of big data analytics.
Advantages of Apache Iceberg
ACID compliance for data reliability and consistency
Schema enforcement and evolution to handle changing data structures
Scalability and performance to support petabyte-scale data and high write throughput
Iceberg Benefits for Connected Vehicle Platforms
Reliable and consistent data ingestion, even during system failures
Seamless schema changes to accommodate new vehicle telemetry and features
Scalable and performant data lake to handle massive data volumes and concurrency
Ford's Journey with Connected Vehicle Event Store using Apache Iceberg
Ford's Connected Vehicle Platform Overview
Manages over 20 million vehicles globally, enabling bidirectional data exchange and remote vehicle operations.
Building the Event Store Platform
Initial scope: High-cardinality data with moderate freshness requirements.
Rapid growth and scalability challenges:
Increased job processing times
Degraded query performance
Rising storage and compute costs
Data aging issues
Optimizing the Platform with Apache Iceberg
Clean Zone Improvements:
Leveraging Glue's lazy loading to optimize file listing
Replacing custom UDFs with Spark native functions
Moving to a "backet of files" approach
Migrating to Apache Iceberg:
Creating a view to seamlessly transition to Iceberg tables
Leveraging Iceberg's table compaction to address small file issues
Achieving 80% improvement in query performance
Future Enhancements: Streaming with Iceberg
Adopting a streaming approach to provide real-time access to critical vehicle data
Maintaining a backup raw data layer using the streaming approach
Conclusion: The Transformative Power of Apache Iceberg on AWS
Iceberg's features, such as ACID compliance, schema management, and scalability, are key to modernizing connected vehicle data lakes.
AWS provides comprehensive support for Apache Iceberg through various analytics services, including EMR, Glue, Athena, Sagemaker, and Redshift.
The integration with AWS Glue Catalog and S3 storage ensures a cost-effective, scalable, and optimized transactional data lake.
Your Digital Journey deserves a great story.
Build one with us.
This website stores cookies on your computer.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.