AWS re:Invent 2025 - Building scalable applications with text and multimodal understanding (AIM375)

Building Scalable Applications with Text and Multimodal Understanding

Introduction

Presented by DH Rajput, Principal Product Manager at Amazon AGI (Artificial General Intelligence)

Discussed how to utilize data beyond just text, such as images, documents, videos, and audio, to build accurate, context-aware applications using Amazon Nova Foundation models.

Joined by Brandon Nyer, Senior Product Manager, and Tyianne, representing Box, to discuss image/video understanding and customer use cases.

Enterprise Needs and Challenges

Organizations have vast amounts of multimodal data (text, structured data, contracts, videos, call recordings) but only use a small portion, mostly text or structured data.

Key challenges with using multimodal data:

Separate models and tools for each modality, leading to complexity and lack of context integration.
Difficulty in reasoning across modalities to deliver customer insights.
Inaccurate models requiring human intervention, which doesn't scale.

Amazon Nova 2.0 Models

Designed to treat all modalities as first-class citizens, with native multimodal processing capabilities.

Variety of models to cater to different cost, latency, and accuracy profiles:

Nova 2 Light: Fast, cost-effective reasoning model
Nova 2 Pro: Most intelligent model for complex tasks
Nova 2 Omni: Unified model for understanding and generation
Nova 2 Sonic: Conversational speech-to-text model
Nova Multimodal Embeddings: Cross-modal search and retrieval

Key features:

1 million context window to process long-form content
Multilingual support (200+ languages, 10+ for speech)
Integrated reasoning capabilities

Document Intelligence

Optimized for two key primitives: Optical Character Recognition (OCR) and Key Information Extraction (KIE).

OCR optimizations:

Robust real-world OCR for challenging documents (handwritten, low-quality scans, tilted)
Mixed context understanding (text, charts, tables)
Structured output (JSON, HTML, XML)

KIE optimizations:

Schema-driven extraction
Layout-aware text extraction
Integrated reasoning to validate extracted data

Image and Video Understanding

Vision Perception: Identifying objects, attributes, and spatial relationships in images.

Example: Detecting plants, cushions, table, and TV in a living room image.

Reasoning and Scene Semantics: Leveraging reasoning capabilities to understand context and make logical deductions.

Temporal Understanding: Analyzing video content to identify events, generate captions, and extract timestamps.

Example: Identifying timestamps when someone is standing on a boat in a video.

Amazon Nova Multimodal Embeddings

Unified model that generates embeddings for text, images, documents, videos, and audio.

Represents all modalities in the same embedding space, enabling cross-modal applications.

Key features:

Unmatched modality coverage
Long context length (8,000 tokens)
Segmentation capabilities
Synchronous and asynchronous APIs
Choice of embedding dimensions for cost-accuracy tradeoff

Benchmarks show strong performance on video retrieval, visual document search, and text-based tasks.

Box Use Cases

Box is the leading intelligent content management company, serving over 115,000 organizations.

Challenges in accessing unstructured data (PDFs, documents, videos, etc.) and extracting insights.

Previous approaches:

Text extraction and embedding
Human annotation

Limitations: Lack of context, scalability issues, and slow processing.

Amazon Nova Multimodal Embeddings unlocks new use cases:

Generating instant insights from long, technical documents
Automating workflows by extracting actionable information
Enabling continuity checks for media production by searching video content

Conclusion and Next Steps

Box is actively testing and integrating the new Amazon Nova 2.0 models.

Exploring new use cases enabled by the multimodal understanding capabilities.

Focused on scaling the deployment of these models in production environments to benefit customers.

AWS re:Invent 2025 - Building scalable applications with text and multimodal understanding (AIM375)

Building Scalable Applications with Text and Multimodal Understanding

Introduction

Enterprise Needs and Challenges

Amazon Nova 2.0 Models

Document Intelligence

Image and Video Understanding

Amazon Nova Multimodal Embeddings

Box Use Cases

Conclusion and Next Steps

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

AWS re:Invent 2025 - Building scalable applications with text and multimodal understanding (AIM375)

Building Scalable Applications with Text and Multimodal Understanding

Introduction

Enterprise Needs and Challenges

Amazon Nova 2.0 Models

Document Intelligence

Image and Video Understanding

Amazon Nova Multimodal Embeddings

Box Use Cases

Conclusion and Next Steps

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.