Researchers claim it is the first AI model that can synthesize speech-generation tasks completely on its own.

Meta’s AI labs have introduced a first-of-its-kind artificial intelligence model. It is designed to do for speech what other AI services can do with text and images. According to the researchers, the new “Voicebox” generative AI model can generalize to speech-generation tasks it was not specifically trained to accomplish, and do so with “state-of-the-art performance”.

Synthesizing speech in six languages

Meta AI detailed the new AI service in a blog post. They explain that, like generative systems for images and text, Voicebox “creates outputs in a vast variety of styles”. It can also create outputs from scratch as well as modify a sample it’s given.

The difference, of course, is that Voicebox produces high-quality audio clips instead of creating a picture or a body of text. The model can synthesize speech across six languages, the researchers claim. In addition, it can also perform noise removal, content editing, style conversion, and “diverse sample generation”.

Meta trained Voicebox with more than 50,000 hours of recorded speech and transcripts from public domain audiobooks. This contains data in English, French, Spanish, German, Polish, and Portuguese. The AI is trained to predict a speech segment when given the surrounding speech and the transcript of the segment. It will then apply this across speech generation tasks.

What makes Voicebox so special?

The main technological breakthrough that makes this new AI model so unique is its ability to synthesize speech in a fully autonomous mode. Prior to Voicebox, generative AI for speech required specific training for each task using carefully prepared training data. Voicebox can learn “just from raw audio and an accompanying transcription”, the researchers say.

In order to make the AI output sound more “human”, Meta built Voicebox based on a method called Flow Matching (FM). This helps Voicebox to outperform Microsft’s VALL-E in terms of intelligibility and audio similarity, they claim.

A potential for misuse

“As with other powerful new AI innovations, we recognize that this technology brings the potential for misuse and unintended harm”, the researchers write. “In our paper, we detail how we built a highly effective classifier that can distinguish between authentic speech and audio generated with Voicebox to mitigate these possible future risks”, they add.