ML infrastructure at Zoox that powers autonomous driving of robotaxis (AMZ201)

ML Infrastructure at Zuk: Powering Autonomous Driving of Robot Taxis

Introduction

  • Zuk was founded in 2014 to make personal transportation safer, cleaner, and more enjoyable.
  • Zuk is building a fully autonomous, all-electric robot taxi designed specifically for riders, not drivers.
  • Zuk's mission is to reinvent personal transportation and reduce the high fatality rates, inefficiencies, and pollution caused by human-driven cars.

Key Machine Learning Use Cases at Zuk

  1. Autonomous Driving:

    • Perception: Detecting and classifying objects (pedestrians, vehicles, traffic signals, etc.) using sensors like cameras, LiDAR, and radar.
    • Prediction: Forecasting the future behavior of detected objects to plan safe navigation.
    • Planning: Determining the best route and actions (acceleration, braking, lane changes) to safely reach the destination.
    • Collision Avoidance: Redundant end-to-end system to predict and avoid potential collisions.
  2. Other ML Use Cases:

    • Generative AI: Creating diverse scenarios for simulation-based validation of the AI system.
    • Foundational Models: Building models for task-agnostic scene understanding, bug triaging, and knowledge mining.

ML Infrastructure at Zuk

The goal of Zuk's ML Infrastructure team is to reduce the end-to-end time for developing and deploying machine learning models. The infrastructure consists of four key components:

  1. Data Infrastructure:

    • Challenges: Data management, discoverability, availability, and governance.
    • Solution: Medallion architecture on Amazon S3, using Delta tables and Apache Spark.
  2. Training Infrastructure:

    • Key features: OSS frameworks (PyTorch, Jax), optimized data loaders, model repository, and experiment tracking.
  3. Serving Infrastructure:

    • On-vehicle inference optimization using NVIDIA TensorRT.
    • Cloud-based high-throughput batch inference on Amazon EKS with Ray Serve.
  4. Compute and Storage Infrastructure:

    • Compute: EC2, EKS, and Slurm for workflow scheduling.
    • Storage: Amazon S3 and Amazon FSX for Lustre.
    • Workflow orchestration with Apache Airflow.

Unique Aspects of ML Infrastructure for an AV Company

  1. Simulation-heavy Validation: Extensive use of state-of-the-art simulation to validate the AI stack before real-world deployment.
  2. GPU-hungry Workloads: Spiky GPU demand for training and simulation, requiring cost-efficient GPU procurement strategies.
  3. Strict Inference Latency Requirements: Models must run multiple times per second for a seamless autonomous driving experience, necessitating optimizations like quantization and pruning.

How AWS Accelerates Innovation at Zuk

  1. ML Capacity Blocks: Guaranteed GPU capacity reservations, allowing Zuk to run experiments and fine-tune models more efficiently.
  2. Amazon S3 Intelligent Tiering: Automatic data lifecycle management to control storage costs for Zuk's vast data lake.
  3. Amazon FSX for Lustre: High-performance file system to provide low-latency access to data for Zuk's training and inference workloads.
  4. Amazon OpenSearch Service: Powering log analytics, semantic embeddings, and nearest neighbor searches for Zuk's data mining and debugging use cases.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us