2 min

Tags in this article

, ,

Amazon researchers recently published details of a new text-to-speech model called Base TTS. This model is said to make words sound more natural than previous neural networks.

The paper shows that the new Base TTS model is the largest neural network to date in the category of text-to-speech models. The most advanced version of the model is based on about 1 billion parameters. The larger the number of parameters, the more tasks an AI model is expected to be able to perform.

The Base TTS model was trained on 100,000 hours of audio files publicly available on the Internet. About 90 per cent of these audio clips were in English.

Better quality pronunciation

The model, the researchers further point out, improves the pronunciation quality of words compared to previous text-to-speech models. Evaluation of the model by linguists would have shown that the Base TTS model would successfully pronounce, for example, the “@ sign” and other symbols, paralinguistic sounds such as “shh”.

The model also pronounced loud English sentences containing foreign words and questions. The model would have accomplished these tasks even though it was not explicitly trained on different types of sentences applied in the evaluation dataset.

Two AI models

Amazon’s text-to-speech model runs on two separate AI models. Based on the Transformer architecture that again has GPT-4 as its foundation, the first model transforms the input text into abstract mathematical “representations” or “speech codecs. This model also additionally compresses the speech codecs, making processing faster. Also, this model helps keep unwanted elements, such as background noise, out of the final Base TTS audio.

The second neural network then transforms these speech codecs into audio. It does this by converting data into spectrograms. These are graphs used to visualize sound waves. These graphs can be easily converted into AI-generated speech.

Tip: New OpenAI model Sora can generate videos