ML Infrastructure at Zuk: Powering Autonomous Driving of Robot Taxis
Introduction
- Zuk was founded in 2014 to make personal transportation safer, cleaner, and more enjoyable.
- Zuk is building a fully autonomous, all-electric robot taxi designed specifically for riders, not drivers.
- Zuk's mission is to reinvent personal transportation and reduce the high fatality rates, inefficiencies, and pollution caused by human-driven cars.
Key Machine Learning Use Cases at Zuk
-
Autonomous Driving:
- Perception: Detecting and classifying objects (pedestrians, vehicles, traffic signals, etc.) using sensors like cameras, LiDAR, and radar.
- Prediction: Forecasting the future behavior of detected objects to plan safe navigation.
- Planning: Determining the best route and actions (acceleration, braking, lane changes) to safely reach the destination.
- Collision Avoidance: Redundant end-to-end system to predict and avoid potential collisions.
-
Other ML Use Cases:
- Generative AI: Creating diverse scenarios for simulation-based validation of the AI system.
- Foundational Models: Building models for task-agnostic scene understanding, bug triaging, and knowledge mining.
ML Infrastructure at Zuk
The goal of Zuk's ML Infrastructure team is to reduce the end-to-end time for developing and deploying machine learning models. The infrastructure consists of four key components:
-
Data Infrastructure:
- Challenges: Data management, discoverability, availability, and governance.
- Solution: Medallion architecture on Amazon S3, using Delta tables and Apache Spark.
-
Training Infrastructure:
- Key features: OSS frameworks (PyTorch, Jax), optimized data loaders, model repository, and experiment tracking.
-
Serving Infrastructure:
- On-vehicle inference optimization using NVIDIA TensorRT.
- Cloud-based high-throughput batch inference on Amazon EKS with Ray Serve.
-
Compute and Storage Infrastructure:
- Compute: EC2, EKS, and Slurm for workflow scheduling.
- Storage: Amazon S3 and Amazon FSX for Lustre.
- Workflow orchestration with Apache Airflow.
Unique Aspects of ML Infrastructure for an AV Company
- Simulation-heavy Validation: Extensive use of state-of-the-art simulation to validate the AI stack before real-world deployment.
- GPU-hungry Workloads: Spiky GPU demand for training and simulation, requiring cost-efficient GPU procurement strategies.
- Strict Inference Latency Requirements: Models must run multiple times per second for a seamless autonomous driving experience, necessitating optimizations like quantization and pruning.
How AWS Accelerates Innovation at Zuk
- ML Capacity Blocks: Guaranteed GPU capacity reservations, allowing Zuk to run experiments and fine-tune models more efficiently.
- Amazon S3 Intelligent Tiering: Automatic data lifecycle management to control storage costs for Zuk's vast data lake.
- Amazon FSX for Lustre: High-performance file system to provide low-latency access to data for Zuk's training and inference workloads.
- Amazon OpenSearch Service: Powering log analytics, semantic embeddings, and nearest neighbor searches for Zuk's data mining and debugging use cases.