TalksAWS re:Invent 2025 - Building scalable applications with text and multimodal understanding (AIM375)

AWS re:Invent 2025 - Building scalable applications with text and multimodal understanding (AIM375)

Building Scalable Applications with Text and Multimodal Understanding

Introduction

  • Presented by DH Rajput, Principal Product Manager at Amazon AGI (Artificial General Intelligence)
  • Discussed how to utilize data beyond just text, such as images, documents, videos, and audio, to build accurate, context-aware applications using Amazon Nova Foundation models.
  • Joined by Brandon Nyer, Senior Product Manager, and Tyianne, representing Box, to discuss image/video understanding and customer use cases.

Enterprise Needs and Challenges

  • Organizations have vast amounts of multimodal data (text, structured data, contracts, videos, call recordings) but only use a small portion, mostly text or structured data.
  • Key challenges with using multimodal data:
    1. Separate models and tools for each modality, leading to complexity and lack of context integration.
    2. Difficulty in reasoning across modalities to deliver customer insights.
    3. Inaccurate models requiring human intervention, which doesn't scale.

Amazon Nova 2.0 Models

  • Designed to treat all modalities as first-class citizens, with native multimodal processing capabilities.
  • Variety of models to cater to different cost, latency, and accuracy profiles:
    • Nova 2 Light: Fast, cost-effective reasoning model
    • Nova 2 Pro: Most intelligent model for complex tasks
    • Nova 2 Omni: Unified model for understanding and generation
    • Nova 2 Sonic: Conversational speech-to-text model
    • Nova Multimodal Embeddings: Cross-modal search and retrieval
  • Key features:
    • 1 million context window to process long-form content
    • Multilingual support (200+ languages, 10+ for speech)
    • Integrated reasoning capabilities

Document Intelligence

  • Optimized for two key primitives: Optical Character Recognition (OCR) and Key Information Extraction (KIE).
  • OCR optimizations:
    1. Robust real-world OCR for challenging documents (handwritten, low-quality scans, tilted)
    2. Mixed context understanding (text, charts, tables)
    3. Structured output (JSON, HTML, XML)
  • KIE optimizations:
    1. Schema-driven extraction
    2. Layout-aware text extraction
    3. Integrated reasoning to validate extracted data

Image and Video Understanding

  • Vision Perception: Identifying objects, attributes, and spatial relationships in images.
    • Example: Detecting plants, cushions, table, and TV in a living room image.
  • Reasoning and Scene Semantics: Leveraging reasoning capabilities to understand context and make logical deductions.
  • Temporal Understanding: Analyzing video content to identify events, generate captions, and extract timestamps.
    • Example: Identifying timestamps when someone is standing on a boat in a video.

Amazon Nova Multimodal Embeddings

  • Unified model that generates embeddings for text, images, documents, videos, and audio.
  • Represents all modalities in the same embedding space, enabling cross-modal applications.
  • Key features:
    • Unmatched modality coverage
    • Long context length (8,000 tokens)
    • Segmentation capabilities
    • Synchronous and asynchronous APIs
    • Choice of embedding dimensions for cost-accuracy tradeoff
  • Benchmarks show strong performance on video retrieval, visual document search, and text-based tasks.

Box Use Cases

  • Box is the leading intelligent content management company, serving over 115,000 organizations.
  • Challenges in accessing unstructured data (PDFs, documents, videos, etc.) and extracting insights.
  • Previous approaches:
    1. Text extraction and embedding
    2. Human annotation
  • Limitations: Lack of context, scalability issues, and slow processing.
  • Amazon Nova Multimodal Embeddings unlocks new use cases:
    1. Generating instant insights from long, technical documents
    2. Automating workflows by extracting actionable information
    3. Enabling continuity checks for media production by searching video content

Conclusion and Next Steps

  • Box is actively testing and integrating the new Amazon Nova 2.0 models.
  • Exploring new use cases enabled by the multimodal understanding capabilities.
  • Focused on scaling the deployment of these models in production environments to benefit customers.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.