On Tuesday, French AI startup Mistral released Voxtral, its first family of open-source speech understanding models designed to bridge the gap between affordable but limited transcription tools and expensive proprietary systems.
Released under Apache 2.0 license, Voxtral is available in two variants: a 24B parameter model for production deployments and a 3B parameter version optimized for local and edge use. Both models go far beyond simple transcription, offering native semantic understanding, multilingual support across nine languages, and the ability to handle audio up to 40 minutes long.
Key capabilities include built-in Q&A and summarization, function-calling directly from voice commands, and automatic language detection. The models retain the text understanding abilities of their Mistral Small 3.1 backbone, making them versatile for various applications.
Mistral claims Voxtral outperforms OpenAI's Whisper and competes with premium services like ElevenLabs Scribe and GPT-4o-mini, while costing "less than half the price" of comparable solutions. API pricing starts at just $0.001 per minute.
The launch positions Mistral as a major challenger in the voice AI space, offering developers production-ready speech intelligence without the constraints of closed systems. Users can access Voxtral through Hugging Face downloads, Mistral's API, or test it in Le Chat's voice mode, which is rolling out to all users over the coming weeks.
Comments