Meta unveils Voicebox, a "breakthrough" generative AI for speech

Researchers claim it is the first AI model that can synthesize speech-generation tasks completely on its own.

Meta’s AI labs have introduced a first-of-its-kind artificial intelligence model. It is designed to do for speech what other AI services can do with text and images. According to the researchers, the new “Voicebox” generative AI model can generalize to speech-generation tasks it was not specifically trained to accomplish, and do so with “state-of-the-art performance”.

Synthesizing speech in six languages

Meta AI detailed the new AI service in a blog post. They explain that, like generative systems for images and text, Voicebox “creates outputs in a vast variety of styles”. It can also create outputs from scratch as well as modify a sample it’s given.

The difference, of course, is that Voicebox produces high-quality audio clips instead of creating a picture or a body of text. The model can synthesize speech across six languages, the researchers claim. In addition, it can also perform noise removal, content editing, style conversion, and “diverse sample generation”.

Meta trained Voicebox with more than 50,000 hours of recorded speech and transcripts from public domain audiobooks. This contains data in English, French, Spanish, German, Polish, and Portuguese. The AI is trained to predict a speech segment when given the surrounding speech and the transcript of the segment. It will then apply this across speech generation tasks.

What makes Voicebox so special?

The main technological breakthrough that makes this new AI model so unique is its ability to synthesize speech in a fully autonomous mode. Prior to Voicebox, generative AI for speech required specific training for each task using carefully prepared training data. Voicebox can learn “just from raw audio and an accompanying transcription”, the researchers say.

In order to make the AI output sound more “human”, Meta built Voicebox based on a method called Flow Matching (FM). This helps Voicebox to outperform Microsft’s VALL-E in terms of intelligibility and audio similarity, they claim.

A potential for misuse

“As with other powerful new AI innovations, we recognize that this technology brings the potential for misuse and unintended harm”, the researchers write. “In our paper, we detail how we built a highly effective classifier that can distinguish between authentic speech and audio generated with Voicebox to mitigate these possible future risks”, they add.

Enhance your data protection strategy for 2025

The Data Protection Guide 2025 explores the essential strategies and...

Meta unveils Voicebox, a “breakthrough” generative AI for speech

Synthesizing speech in six languages

What makes Voicebox so special?

A potential for misuse

Stay tuned, subscribe!

vCluster virtualizes Kubernetes for maximum GPU efficiency

Too dense for AC: 800V DC is coming to an AI data center near you

Neurometric AI & LumaDock aim to slash OpenClaw inference costs

Dawnguard promises true shift-left: “The only solution is to build something that isn’t vulnerable”

Microsoft reveals how it scales Kubernetes for OpenAI

Why hyperscalers run containers in VMs: VKS deep dive

Why OpenSearch doubled downloads under open governance

How Nutanix is tackling multi-cloud Kubernetes and AI workloads

AMD “Helios”: Building rack-scale AI Infrastructure for EMEA Enterprises

Taking the right lessons from AI success stories

Why traditional security can’t protect your enterprise against AI threats

Power critical workloads with all-NVMe active-active storage for non-stop enterprise operations

GOTO Copenhagen 2026

Experience Synology’s latest enterprise backup solution

How to choose the right Enterprise Linux platform?

Enhance your data protection strategy for 2025

Strengthen your cybersecurity with DNS best practices