Microsoft API generates realistic speech based on 200 audio clips

Microsoft scientists have described an Artificial Intelligence (AI) system that uses unsupervised learning to achieve 99.84 percent word intelligibility accuracy, as well as 11.7 percent PER for automatic speech recognition. The model used only 200 audio clips and associated transcripts for its training.

Unsupervised learning is a branch of machine learning that derives knowledge from unlabelled, unclassified and uncategorized test data. The scientists were able to develop their AI system with this technique thanks to Transformers, Venturebeat knows.

Transformers are a type of neural architecture that was introduced in 2017 in a paper by scientists from Google Brain. Transformers, like all other deep neural networks, contain neurons, which are mathematical functions loosely modelled after biological neurons. Those neurons are placed in interconnected layers, sending signals from input data. They also slowly adjust the synaptic force – the weights – of each connection.

Unique is that each output element is connected to each input element. The weights between the two are calculated dynamically.

Training and results

Microsoft scientists placed a Transformer component in their AI system design that can pick up speech or text as input or output. They then used the publicly available LJSpeech dataset – which contains 13,100 English audio snippets and transcripts – for training data. From these, the team randomly chose two hundred clips to create a dataset for the training. They also used a denoising audio encoder component to repair corrupt speech and text.

The results were – especially given the small group of test data – quite good. The researchers claim that it had better results than the three baseline algorithms in the tests. According to Venturebeat, several of the generated samples also sound like human beings.

Scientists want to further extend the limits of unsupervised learning by using only unconnected voice and text data, using other pre-training methods. In their paper, they suggest that they have proposed a method with virtually no supervision of text to speech and automatic speech recognition that uses a few linked voice and text data and a few unconnected pieces of data.

This news article was automatically translated from Dutch to give Techzine.eu a head start. All news articles after September 1, 2019 are written in native English and NOT translated. All our background stories are written in native English as well. For more information read our launch article.

Top story

Inside TCS’ digital race behind Formula E

The world of Formula E combines technology and speed with sustainability. It's a blend that Tata Consultancy ...

Erik van Klinken June 27, 2025

Tech calendar

Microsoft API generates realistic speech based on 200 audio clips

Training and results

Stay tuned, subscribe!

Ingram Micro hit by outage, being unavailable for almost a day

The AI wave is forcing organizations to rethink their infrastructure

SAP CEO says EU doesn’t need a massive AI buildout. Is he right?

HPE closes acquisition of Juniper Networks

Thales covers data security entirety thanks to Imperva

AI is an additional weapon for cybersecurity

CyberArk extends PAM solution to machine identities

Krijg Volledig Inzicht van Gebruiker tot Cloud met Cisco ThousandEyes

GITEX DIGI_HEALTH 5.0 - Thailand

IT Arena

Innovation Week 2025

Luxembourg Venture Days

Appdevcon

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices