Microsoft API generates realistic speech based on 200 audio clips

Get a free Techzine subscription!

Microsoft scientists have described an Artificial Intelligence (AI) system that uses unsupervised learning to achieve 99.84 percent word intelligibility accuracy, as well as 11.7 percent PER for automatic speech recognition. The model used only 200 audio clips and associated transcripts for its training.

Unsupervised learning is a branch of machine learning that derives knowledge from unlabelled, unclassified and uncategorized test data. The scientists were able to develop their AI system with this technique thanks to Transformers, Venturebeat knows.

Transformers are a type of neural architecture that was introduced in 2017 in a paper by scientists from Google Brain. Transformers, like all other deep neural networks, contain neurons, which are mathematical functions loosely modelled after biological neurons. Those neurons are placed in interconnected layers, sending signals from input data. They also slowly adjust the synaptic force – the weights – of each connection.

Unique is that each output element is connected to each input element. The weights between the two are calculated dynamically.

Training and results

Microsoft scientists placed a Transformer component in their AI system design that can pick up speech or text as input or output. They then used the publicly available LJSpeech dataset – which contains 13,100 English audio snippets and transcripts – for training data. From these, the team randomly chose two hundred clips to create a dataset for the training. They also used a denoising audio encoder component to repair corrupt speech and text.

The results were – especially given the small group of test data – quite good. The researchers claim that it had better results than the three baseline algorithms in the tests. According to Venturebeat, several of the generated samples also sound like human beings.

Scientists want to further extend the limits of unsupervised learning by using only unconnected voice and text data, using other pre-training methods. In their paper, they suggest that they have proposed a method with virtually no supervision of text to speech and automatic speech recognition that uses a few linked voice and text data and a few unconnected pieces of data.

This news article was automatically translated from Dutch to give a head start. All news articles after September 1, 2019 are written in native English and NOT translated. All our background stories are written in native English as well. For more information read our launch article.