HeadsUpAI

Mistral AI Launches Voxtral TTS to Challenge Proprietary Models with Open Weights

· Updated

Mistral AI launched Voxtral TTS, a 4B-parameter model designed for natural, expressive speech. Built on the Ministral 3B backbone, the architecture combines a 3.4B decoder with a 390M flow-matching acoustic transformer and a 300M neural audio codec. It supports nine languages and captures subtle nuances like rhythm, intonation, and regional dialects.

The model addresses the latency-quality trade-off critical for voice agents, achieving a 70ms model latency. It matches the quality of proprietary leaders like ElevenLabs while enabling zero-shot cross-lingual adaptation. This allows a speaker's unique voice and accent to be preserved when generating speech in a different language.

You can access Voxtral TTS via API at $0.016 per 1,000 characters or test it in Mistral Studio. An open-weight version is available on Hugging Face under a CC BY NC 4.0 license. This enables cost-effective, localized voice workflows for customer support and real-time translation.

Mistral AI
Mistral AI
@MistralAI
X

🔊Introducing Voxtral TTS: our new frontier open-weight model for natural, expressive, and ultra-fast text-to-speech 🎭Realistic, emotionally expressive speech. 🌍Supports 9 languages and accurately captures diverse dialects. ⚡Very low latency for time-to-first-audio. 🔄Easily adaptable to new voices

516retweets
View on X

Share this update