Text-to-Speech (TTS) Software: A Simple Guide

Amazon Polly includes the following text-to-speech features essential for modern speech development.

Range of voices

With the ability to select different languages, regions, genres, and voices within a region, you can enjoy a more complete suite of products for development. Amazon Polly supports dozens of languages, as well as national variations and accents, in both male and female formats.

API-based integration

Verify that your text-to-speech software has a fully functional API available in multiple programming languages, in order to have the widest range of integrations possible across different projects. Amazon Polly offers the Amazon Polly API and several language-specific Software Development Kits (SDKs). You can also access it from the AWS Management Console and the AWS Command Line Interface (CLI). You have complete control over all features of Amazon Polly, no matter how you use it.

Precise voice control

Speech Synthesis Markup Language (SSML) is an XML-based markup language that allows you to provide additional information about how speech output should sound. For example, you can include pauses, interpretations (e.g. dates, acronyms), pitch, speed, volume, emphasis, fade, and other audio elements to customize the generated voice. SSML allows you to fully control speech outputs and import customization into other systems.

Amazon Polly supports both common and custom Amazon SSML tags, such as the ability to make a voice sound like a reporter. Thanks to this level of flexibility, you can create extremely lifelike voices that capture and hold users’ attention.

Metadata hooks for synchronized animations

Some applications, such as games and multimedia, require character animations to follow audio, including mouth movements or karaoke-style text displayed on the screen. Multilingual training videos could also benefit from timing synchronization in multiple languages, so that the audio aligns with the video at the same time for all languages.

For these kinds of applications, developers need metadata to mark which elements of speech occur at a particular time in a timestamped format. Amazon Polly allows you to request this additional metadata, or voice tags, along with the voice file. Speech markers offer information such as the timestamp of the audio file, the viseme (the position of the face and mouth when a word is spoken), and other details that connect the written text to the speech output.

Customization

Text-to-speech software must be fully customizable to ensure maximum flexibility. For example, audio output must be customizable for different formats and configurations, including file type, file size, and data quality. The software must be able to handle custom vocabulary that is outside of the training data.

Amazon Polly supports text-to-speech customization at every stage.

Vocabulary

You can create a custom dictionary with custom pronunciations for company names, acronyms, foreign words and neologisms. You can request output in multiple voice formats, such as MP3 and WAV.

Output format

Amazon Polly also supports long-lasting audio, such as reading documents, in a natural voice. You can generate continuous audio streams for connections with lower bandwidths or low latency in real-time use cases.

Voce

We also offer Brand Voice, a personalized engagement that lets you work with the Amazon Polly team to build a voice exclusively for your organization. Instead of offering an app like any other, you can create a voice-based brand to stand out.

Related Posts

Leave a Comment