Document narrator using AWS Polly

Introduction

When the first Iron Man movie was released, we learned about Jarvis, Iron man’s talking AI. It was fascinating for me. I always wondered about programs that could talk like a human being. This science fiction of programs talking like a human beings is a reality now, thanks to the advancement of DeepLearning. It enables you to add conversation user experience to your application. Also, if you receive various documents daily, And you need to go through those as a part of your job, isn’t it better to have someone reading them for you? Or maybe you are lazy like me and want someone to simplify the job. You can build and use Deep learning models that synthesize speech from texts. Isn’t that too much? What if I tell you there is already a service from AWS that lets you create human-like speech from a text file? And AWS provides a free tier for it. Let me formally introduce you to the service, introducing Amazon Polly, A text to speech service from AWS. I am going to use Polly to create a document reader. I will use AWS CDK to build the infrastructure. Having a little knowledge of CDK will help but not needed. If you want to know more about CDK, you can check out our video on this topic.

What is Amazon Polly?

Amazon Polly is a text-to-speech service from AWS. It can turn texts into life-like speech, allowing you to create applications that can narrate your texts and build entirely new categories of speech-enabled products. Polly uses advanced deep-learning techniques to synthesize human-sounding speech. Polly supports almost all the most spoken languages and different voices to match the nuances of a language. It also provides male and female voices for all the supported languages.

Why use Amazon Polly?

Natural-sounding voices - Amazon Polly provides almost all the spoken languages and a wide selection of male and female voices. Polly’s pronunciation of texts is very fluid, this enables us to synthesize very natural-sounding speeches.
Real-time streaming - For speech-enable conversational user experience, we expect the conversion of texts to speech to be very consistent and fast. When we send a text to Polly’s API, it converts into speech instantly.
Store & redistribute speech - Using Amazon Polly, we can create audio in a standard format like MP3 or OGG. Once created, you can store and replay the audio file any number of times without additional fees.
Customize & control speech output - Amazon Polly supports lexicons and SSML tags which enable you to manage aspects of speech, such as pronunciation, volume, pitch, speed rate, etc.
Low cost - Amazon Polly follows a pay-as-you-go model. It only charges you for the number of characters it converts.

Walkthrough of the Console

Login into your AWS console. Then search for Amazon Polly. It will lead you to Polly’s home page. Once there, you can select the engine type neural or standard, select your preferred language, select voice, and write your text in the input text box. Then you can click on the listen button to synthesize the text into speech and listen to it, download the audio file by clicking on the download button, or you can store it in an s3 bucket. Screenshot 2022-09-11 at 8.48.36 PM.png

Pricing

Amazon Polly follows a pay-as-you-go pricing model. You are billed monthly based on the number of text characters you processed. Polly charges you $4.00 per million characters for standard voices. And $16.00 per million characters for neural voices. But there is a free tier for the first 12 months, starting from your first speech request. It covers 5 million characters per month for standard voices and 1 million characters for neural voices.[2]

An example application using CDK

Now let us create a document narrator using Amazon Polly. I am using CDK as my IaC and python as my preferred language for this project.

First, you will need to have CDK, Node js and Python installed in your system to run this project.

If you don’t have Nodejs, install it from node js official site. If you don’t have CDK, follow this instruction to install it. Similarly, if you don’t have Python, install it from the python official site.

First, I will create an empty directory and make it my working directory.


mkdir doc_narrator
cd doc_narrator

Then create a CDK project using CDK init. This will create an empty CDK project.


cdk init app --language=python

After the creation of the project, activate the virtual environment and install the dependencies in requirements.txt.

For this project, I will need an s3 bucket to store the text file, and a lambda to call Amazon Polly’s API to synthesize speech. Optionally you can have an SNS topic also. I will come back to it later.

For this project, we need to import the following constructs.


from aws_cdk import Stack
import aws_cdk
from constructs import Construct
from aws_cdk import aws_s3
from aws_cdk import aws_lambda
from aws_cdk import aws_lambda_event_sources
from aws_cdk import aws_iam
from aws_cdk import aws_sns, aws_sns_subscriptions

Add the following s3 construct to your stack class’ init method.


self.file_bucket = aws_s3.Bucket(self,
            id=f"{self.STACKPREFIX}-command-bucket",
            removal_policy=aws_cdk.RemovalPolicy.DESTROY,
            auto_delete_objects=True
        )

Now in this application, whenever you add a text file in the s3 bucket, it will trigger the lambda function to create the corresponding audio file.

Let’s create an SNS topic and its subscription.

self.sns_topic = aws_sns.Topic(self,
            id=f"{self.STACKPREFIX}-sns-long-audio-topic",
            display_name=f"{self.STACKPREFIX}-async-synth-topic",
        )
self.sns_topic.add_subscription(subscription=aws_sns_subscriptions.EmailSubscription(email_address='<your email address>'))

Let’s add the lambda construct.


self.text_processor_lambda = aws_lambda.Function(self,
            id=f"{self.STACKPREFIX}-text-processor",
            description="This Lambda function gets text file and converts the text to speech",
            runtime=aws_lambda.Runtime.PYTHON_3_8,
            code=aws_lambda.AssetCode.from_asset('src'),
            handler='main.handler',
            environment={
                "OUTPUTBUCKET": self.file_bucket.bucket_name,
                                "SNSTOPICARN": self.sns_topic.topic_arn
            }
        )

We need to set the trigger for the lambda function.


self.text_processor_lambda.add_event_source(
            source=aws_lambda_event_sources.S3EventSource(
                bucket=self.file_bucket,
                events=[aws_s3.EventType.OBJECT_CREATED],
                filters=[aws_s3.NotificationKeyFilter(prefix='files/')]
            )
        )

As you can see, I have mentioned the notification key filter with the prefix ‘files/’. Now, whenever you put a text file in the files directory or folder of the s3 bucket, it will trigger the lambda function. We need to add some IAM role to the lambda function to get the object from the s3 bucket and to call Amazon Polly’s APIs.

self.text_processor_lambda.add_to_role_policy(
            statement=aws_iam.PolicyStatement(
                actions=["s3:GetObject", "s3:PutObject"],
                resources=[f"{self.file_bucket.bucket_arn}/*"]
            )
        )
self.text_processor_lambda.add_to_role_policy(
            statement=aws_iam.PolicyStatement(
                actions=[
                    "polly:SynthesizeSpeech",
                    "polly:StartSpeechSynthesisTask",
                    "polly:GetSpeechSynthesisTask",
                    "polly:ListSpeechSynthesisTasks"
                ],
                resources=["*"]
            )
        )
self.text_processor_lambda.add_to_role_policy(
            statement=aws_iam.PolicyStatement(
                actions=['sns:Publish'],
                resources=[self.sns_topic.topic_arn]
            )
        )

We need to give GetObject to download the file from the bucket and PutObject for the lambda to put the audio file into the bucket. To call Amazon Polly’s API, we need to give SynthesizeSpeech, StartSpeechSynthesisTask, GetSpeechSynthesisTask, and ListSpeechSynthesisTasks to create an async synthesize task.[1]

Let’s create a directory called src. Inside the directory, create a file called main.py. The main.py file has the code to get the text file from the s3 bucket and creates the audio file.

There are two ways to synthesize speech from text. The first one is synthesize_speech, which can create an audio file in near-real time with relatively low latency. To do this, the operation can only synthesize 3000 characters. But a document will have more than 3000 characters. For this, I will have to create an async speech synthesis task.[1]


import os
import boto3

def handler(event, context):
    try:
        record = event['Records'][0]
        bucket_name = record['s3']['bucket']['name']
        key = record['s3']['object']['key']
        s3_client = boto3.client('s3')

        response = s3_client.get_object(
            Bucket=bucket_name,
            Key=key
        )

        file_name = key.split('.')[0]
        data = response['Body'].read().decode('utf-8')

        polly_client = boto3.client('polly')

        output_bucket_name = os.environ.get('OUTPUTBUCKET')
        sns_topic_arn = os.environ.get('SNSTOPICARN')
        response = polly_client.start_speech_synthesis_task(
            Engine='standard',
            Text=data,
            OutputFormat="mp3",
            VoiceId="Joanna",
            SampleRate="16000",
            OutputS3BucketName=output_bucket_name,
            OutputS3KeyPrefix='audio/',
            SnsTopicArn=sns_topic_arn
        )
    except Exception as e:
        # write to SNS topic
        print(e)

Here, I have used 16000 Hz as a moderate sample rate, but you use a higher value to get a better-quality audio file.

Conclusion

Amazon Polly is a great service that enables you to add conversational user experience to your application. You can use it to create text-to-speech applications. Polly is also reasonably priced, easy to use, and produces audio files in near real-time. It is a highly recommended service if you want to add text-to-speech capabilities to your application.