2 min Analytics

Mistral launches Voxtral: open-source speech recognition for businesses

Mistral launches Voxtral: open-source speech recognition for businesses

Mistral is launching its new Voxtral speech models, designed to serve as an alternative to closed APIs offered by competitors. The open-source models feature advanced speech recognition, native multilingualism, and extensive context processing for production environments.

Until now, companies had to choose between open-source ASR systems with high error rates and expensive proprietary APIs. Mistral aims to bridge this gap with the new Voxtral models, which combine state-of-the-art accuracy with native semantic understanding for less than half the price of comparable solutions.

Advanced speech functionality

The company has released two variants: a 24B model for production environments and a 3B variant for local and edge deployments. Both versions are available under the Apache 2.0 license, which allows open use.

The models go beyond transcription. They feature a 32k token context length for audio up to 30 minutes for transcription or 40 minutes for understanding analysis. Additionally, they feature built-in question-and-answer functionality and can generate structured summaries on the fly.

These capabilities make the Voxtral models ideal for real interactions and follow-up actions, such as summaries, responses, analysis, and insights, states Mistral. For cost-effective use cases, Voxtral Mini Transcribe delivers.

Multilingual performance

Voxtral automatically recognizes languages and achieves state-of-the-art performance in the widely used languages English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian. This helps teams serve a global audience with a single system.

In benchmark tests, Voxtral Small consistently outperforms Whisper large-v3 and beats GPT-4o mini Transcribe and Gemini 2.5 Flash in all tasks. In the FLEURS evaluation, it outperforms Whisper in every task and achieves state-of-the-art results in multiple European languages.

The models can also perform function calls directly from speech. This enables the triggering of backend functions, workflows, or API calls based on spoken user intentions, eliminating the need for intermediate processing steps.

Tip: Mistral aims to raise a billion for French AI cloud service