Voice-AI may not be a defined sub-genre of automation intelligence yet, but aiOla would like it to be; the company runs a voice-AI laboratory (no scalpels, just software) dedicated to advancing speech recognition technology. The company’s new Drax service is an open-source AI model that brings flow-matching-based generative methods to speech. But what does that mean and why does it matter?
Go with the flow-matching
As explained in this video, flow-matching-based generative methods are a class of models that learn a “continuous vector field” in order to manage and transform what are relatively simple “noise distributions” into more complex data distributions. They do this by following ordinary differential equations. Instead of learning “discrete denoising steps” (that’s what diffusion models do), they train the flow to match probability paths directly between data and noise.
Put even more simply than that…. it is like drawing a smooth path from noise to data.
By rethinking how speech models are trained, Drax claims to capture the nuance of real-world audio without the delay that slows traditional systems. The company says that Drax delivers accuracy on par with, or better than, leading models like OpenAI’s Whisper while achieving five times lower latency than other major speech systems, such as Alibaba’s Qwen2.
“Modern speech recognition systems struggle to balance speed and accuracy. Models like OpenAI’s Whisper process speech token by token – producing strong results but taking too long for large-scale or long-form audio, such as hour-long meetings or customer calls. To improve speed, some researchers have shifted toward diffusion-based models that process multiple tokens simultaneously,” said Yossi Keshet, chief scientist at aiOla.
Keshet agrees that diffusion-based model systems are faster but often less accurate, since they’re typically trained on clean, idealised data rather than the noisy and unpredictable speech found in real-world settings.
Transcription tool trade-off
As a result, transcription tools continue to fall short in enterprise environments like call centres, healthcare and manufacturing, where conversations are lengthy and even minor errors can affect compliance, analytics and customer experience.
Drax introduces a new way to train speech recognition systems that claims to achieves both speed and accuracy. While most ASR models, like OpenAI’s Whisper, process speech sequentially, predicting one token at a time, Drax outputs the entire token sequence in parallel, capturing full context at once. This parallel, flow-based approach dramatically reduces latency and prevents the compounding errors that often occur in long transcriptions.
“Gone are the days of choosing between transcription accuracy or speed,” said Keshet. “With Drax, we’ve achieved a real breakthrough in speech recognition technology that delivers both technical innovation and immediate real-world impact. By open sourcing it, we hope to spark further discovery and collaboration from the community.”
Like image diffusion models that refine a picture from random noise, Drax learns to reconstruct speech from an initial noisy representation. During training, it moves through a three-step probability path: starting with meaningless noise, transitioning through a speech-like but imperfect middle state and finally converging on the correct transcript. This middle stage exposes Drax to realistic, acoustically plausible errors, helping it stay robust to accents, background noise and natural speech variability.
“There’s no room for error in critical applications of speech technology,” said Gil Hetz, VP of AI at aiOla. “That’s why Drax is such a breakthrough; it combines accuracy and speed without compromise. It handles real-world speech, no matter the background noise, accent, or jargon, making it not only technologically impressive but also the kind of reliability modern enterprises need.”
Antiquated Alibaba now annihilated?
Drax achieved an average word error rate (WER) of 7.4%, on par with Whisper-large-v3 at 7.6%,and outperformed both OpenAI’s Whisper and Alibaba’s Qwen2-Audio on select datasets, while running up to five times faster.
“Voice is the most natural and efficient way for data entry and it will become the default way for human-machine interaction,” said Amir Haramty, President of aiOla. “Today, transcription often fails to keep up with the pace of real-world operations, from customer service to compliance. With Drax, we’re closing that gap and making voice technology actually practical at scale. That’s why advancing speech recognition is so important – it’s the future of the enterprise and I’m incredibly proud to be part of aiOla, where we’re continuously pushing the boundaries of what’s possible.”
Spanish, French, German & Mandarin
aiOla has published research detailing how this single-step training approach enables Drax to generate accurate transcriptions in just a few quick processing steps, dramatically reducing latency without sacrificing precision. In English benchmarks,
Across Spanish, French, German and Mandarin, Drax maintained comparable or better accuracy, demonstrating consistent performance across languages and environments.