Data on AWS: The key to success for 3 AI innovators (STG208)

Here is a detailed summary of the video transcription in markdown format:

Data Curation and Scalable Infrastructure

Importance of Data Quality

  • Customers often deal with massive amounts of raw aggregated data from various sources (e.g., sales transactions, customer interactions, IoT streams, image catalogs).
  • Teams may not need all this raw data to answer business questions or solve analytical problems.
  • The concept of "data curation" involves building and curating more specialized data sets for particular tasks, treating data as a product.
  • This allows for more focused and usable data for researchers and product managers.

Performance Optimization

  • Customers aim to saturate the available bandwidth between storage (e.g., Amazon S3) and compute to maximize performance.
  • Storage can be a bottleneck, causing the entire end-to-end workload to be stalled and increasing compute costs.
  • For Amazon S3, key performance metrics to optimize include request rate, throughput, and first-byte latency.

Scaling Storage

  • Customers need to scale their storage to handle growing data volumes, from hundreds of objects to millions or billions.
  • Services like Amazon S3 Intelligent Tiering can automatically optimize costs by moving data between access tiers based on changing access patterns.

Canva's Approach

  • Canva is a collaborative online visual design tool with over 200 million monthly active users and 30 billion designs created.
  • Canva has a mission to bring AI to the forefront of everything they do, with over 120 ML microservices.
  • Challenges include content moderation, maintaining user trust and privacy, and curating data for various use cases.
  • Canva has built a centralized moderation platform that leverages services like Amazon Rekognition to detect and flag unsafe content.
  • Canva also has a platform to manage user privacy preferences and federate data access, ensuring compliance with regulations and respecting user consent.

Bria's Approach

  • Bria is a generative AI platform for developers, training foundation models in the visual domain from scratch.
  • Bria partners with data providers to obtain fully licensed data, then trains models and provides attribution and royalties back to the data providers.
  • Bria's key challenges include scaling data pipelines to digest petabytes of multimodal data (images and videos) and building infrastructure to train models efficiently.
  • Bria uses data pipelines with cheaper GPUs (e.g., NVIDIA A10) to preprocess and extract insights from the data, then leverages high-performance GPU clusters (e.g., NVIDIA H100) for model training.
  • Bria utilizes Amazon S3, Amazon Glue, and Athena to build a scalable data catalog and attribution engine to track and report royalties to data providers.

Anthropic's Approach

  • Anthropic is an AI research and safety company, known for their AI assistant Claude.
  • Anthropic processes datasets of up to 200 petabytes every two weeks to train their large language models.
  • Anthropic uses Amazon S3 features like S3 Express One Zone, cross-bucket replication, and S3 Intelligent Tiering to optimize performance, cost, and scalability.
  • Anthropic has designed a structured approach to storing model checkpoints in S3, using a hierarchical key structure and lifecycle management to prune older checkpoints.
  • Anthropic also leverages asynchronous I/O and the AWS Common Runtime S3 client to maximize throughput when ingesting data for training.
  • Anthropic emphasizes the importance of designing key structures and naming conventions in S3 to enable future scalability and efficient data management.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us