Details
- Mistral AI introduced Voxtral Transcribe 2, a next-generation family of speech-to-text models featuring state-of-the-art transcription, speaker diarization, and sub-200ms real-time latency for voice agents and live applications.
- Voxtral Realtime offers natively streaming architecture with configurable latency under 200ms, maintaining 1-2% WER of offline model at 480ms, released as open weights under Apache 2.0.
- Voxtral Mini Transcribe 2 provides batch transcription at 4% WER on FLEURS across 13 languages, including speaker diarization, context biasing, and word-level timestamps, priced at $0.003/min via API.
- Models available in two sizes: Voxtral Small (24B parameters) for production-scale with advanced chat, function calling, and 32k-token context (30-40 min audio); Voxtral Mini (3B) for edge deployment, both with multilingual support and built-in Q&A/summarization.
- New audio playground in Mistral Studio allows instant experimentation; API pricing: Mini Transcribe 2 at $0.003/min, Realtime at $0.006/min; outperforms Whisper large-v3 and matches GPT-4o mini on benchmarks.
Impact
Mistral AI's Voxtral Transcribe 2 advances open-weight speech models by delivering sub-200ms latency and superior benchmarks over Whisper large-v3, positioning it competitively against closed APIs like OpenAI's GPT-4o mini Transcribe and ElevenLabs Scribe at less than half the cost, such as $0.001-$0.003 per minute. This pricing and Apache 2.0 release lower barriers for developers building voice agents, multilingual transcription, and multimodal apps, accelerating adoption in enterprise tools like meeting summaries and customer service bots. By integrating reasoning, function calling from voice, and long-context handling in a unified pipeline, Voxtral narrows the gap with proprietary systems, potentially shifting market dynamics toward open models amid rising demand for on-device and real-time inference. Over the next 12-24 months, it could redirect R&D toward hybrid audio-text intelligence, easing GPU constraints via efficient 3B variant while pressuring rivals to match open-source accessibility and performance in global, low-latency applications.
