Amazon has created a new artificial intelligence system that can teach digital speech assistants like Alexa new ways of speaking within hours. For example, it concerns a style of speech similar to that of a newsreader.

According to Trevor Wood, Amazon’s applied science manager, the new text-to-speech system can replace traditional speech training methods. Traditional methods often require actors who talk for tens of hours in the right way to train the models.

Wood explains that synthetic speech produced by neural networks sounds much more natural to users than speech produced by traditional methods. In the latter case, short voice fragments are linked together in an audio database. With the system’s enhanced flexibility, Amazon can easily change the speech style of synthetic speech.


Amazon itself calls the new model “neural text-to-speech”, or NTTS. In their own words, there are two important components. One is a “generative neural network”, which works by converting series of phonemes – pieces of sound that distinguish one word from another – into series of spectrograms. These are visual representations of the spectrum of the frequencies of those sounds, as they change over time. The spectrograms must emphasize “functions that the human brain uses to process speech,” says Wood.

The second component is a “vocoder”, which helps to convert those spectrograms into a continuous audio signal that is used to train the text-to-speech model. The new training method can combine neural text-to-speech data with a few hours of additional data to create a model that can distinguish speech elements that are unique to a specific speech style.

According to Wood, Amazon’s research shows that listeners have a strong preference for the voices made by NTTS. In fact, the method was rated almost as high as normal human speech.

