Build multimodal ASL avatars with bidirectional translation (DEV306)

Introduction

The session discusses how to build multimodal American Sign Language (ASL) avatars with bi-directional translation capabilities.

The presenters are Alak Eswaradass (Principal Solutions Architect at AWS), Suresh Poopandi (Principal Solutions Architect at AWS), and Rob Koch (Principal Data Engineer at Slalom).

They aim to address the challenges faced by Deaf and hard of hearing users when accessing information and communicating with hearing people.

Sign Languages and Challenges

Sign languages, such as American Sign Language (ASL), are the primary language for Deaf and hard of hearing users.

Sign languages use hand gestures, body movements, and facial expressions to communicate.

Relying solely on captions or subtitles can be limiting as they may not capture the nuances and emotional aspects of the conversation.

There is a global shortage of sign language interpreters, which creates accessibility challenges.

Sign Language Avatars

Sign language avatars are AI-powered digital agents that can engage in conversations and provide sign language interpretation.

Two main use cases for sign language avatars:

Narrative avatars: Translate audio/video content into sign language in real-time.
Conversational avatars: Facilitate conversations between Deaf/hard of hearing users and hearing users.

Customization and inclusive communication are important aspects of the sign language avatars.

Technical Solution: GenASL

GenASL is a generative AI-powered application that enables visual communication for individuals who rely on it.

It has two main flows:

Sign language video generation: Converts English audio to ASL avatar video.
Video detection: Converts ASL video to English text and audio.

The solution leverages various multimodal AI models and services, such as Amazon Transcribe, Anthropic's Claude 3.5 Sonnet, and Meta's Llama 3.2 Vision Instruct.

The architecture follows a decoupled approach, allowing for customization and integration of different foundational models.

Use Cases and Customization

The sign language avatars can be applied in various industry sectors, such as healthcare, finance, media, and education.

Common approaches for customizing foundational models include prompt engineering, retrieval-augmented generation, fine-tuning, and continued pre-training.

The GenASL solution has leveraged fine-tuning techniques to adapt the models to the specific use cases.

Demonstrations

The presenters showcase four demo scenarios:

Narrative avatar: Translating a training video from audio to ASL avatar.
Narrative avatar: Translating a presentation from audio to ASL avatar.
Conversational avatar: Facilitating a check-in conversation at a hotel.
Conversational avatar: Assisting a customer in finding coffee machines at a retail store.

Architecture and Implementation

The solution's architecture is divided into two main parts:

ASL video generation flow:
- Converts English audio to text, then to ASL Gloss, and finally to smooth ASL avatar video.
Video detection flow:
- Converts ASL video to English text and then to English audio.

The presenters discuss the use of various models and services, such as Amazon Transcribe, Anthropic's Claude 3.5 Sonnet, Stable Diffusion, and Meta's Llama 3.2 Vision Instruct.

They also share best practices for integrating the solution, such as considerations for live streaming, leveraging Amplify Gen2, and utilizing Bedrock features.

Future Developments and Conclusion

Plans for future development include adding multilingual support and exploring the potential of unified multimodal models.

The presenters encourage attendees to explore the existing resources, such as the previous year's AWS ML blog and the upcoming Chalk Talk session, to learn more about the solution.

Build multimodal ASL avatars with bidirectional translation (DEV306)

Introduction

Sign Languages and Challenges

Sign Language Avatars

Technical Solution: GenASL

Use Cases and Customization

Demonstrations

Architecture and Implementation

Future Developments and Conclusion

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

Build multimodal ASL avatars with bidirectional translation (DEV306)

Introduction

Sign Languages and Challenges

Sign Language Avatars

Technical Solution: GenASL

Use Cases and Customization

Demonstrations

Architecture and Implementation

Future Developments and Conclusion

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.