Mistral AI Launches Voxtral TTS to Challenge Proprietary Models with Open Weights

Mistral AI

Mar 28, 2026 · Updated Apr 25, 2026

Mistral AI launched Voxtral TTS, a 4B-parameter text-to-speech model capable of zero-shot voice cloning from just three seconds of audio. By offering frontier-grade emotional expressiveness and low latency in an open-weight format, it provides a high-performance alternative to closed-source providers for building real-time voice agents.

Mistral AI launched Voxtral TTS, a 4B-parameter model designed for natural, expressive speech. Built on the Ministral 3B backbone, the architecture combines a 3.4B decoder with a 390M flow-matching acoustic transformer and a 300M neural audio codec. It supports nine languages and captures subtle nuances like rhythm, intonation, and regional dialects.

The model addresses the latency-quality trade-off critical for voice agents, achieving a 70ms model latency. It matches the quality of proprietary leaders like ElevenLabs while enabling zero-shot cross-lingual adaptation. This allows a speaker's unique voice and accent to be preserved when generating speech in a different language.

You can access Voxtral TTS via API at $0.016 per 1,000 characters or test it in Mistral Studio. An open-weight version is available on Hugging Face under a CC BY NC 4.0 license. This enables cost-effective, localized voice workflows for customer support and real-time translation.

View the full update on mistral.ai

Mistral AI

@MistralAIMar 26

🔊Introducing Voxtral TTS: our new frontier open-weight model for natural, expressive, and ultra-fast text-to-speech 🎭Realistic, emotionally expressive speech. 🌍Supports 9 languages and accurately captures diverse dialects. ⚡Very low latency for time-to-first-audio. 🔄Easily adaptable to new voices

516

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Mistral AI →

Keep reading

Mistral AI Unifies Reasoning and Coding with Mistral Medium 3.5 and Remote Agents

Mistral AI released Mistral Medium 3.5, a 128B dense model that integrates instruction-following, reasoning, and coding into a single set of weights. The launch includes remote parallel agents for the Vibe coding tool and a new agentic Work mode in Le Chat for long-horizon research.

Mistral AI Joins NVIDIA Nemotron Coalition to Co-Develop Open Frontier Models

Mistral AIMar 18

Mistral AI Joins NVIDIA Nemotron Coalition to Co-Develop Open Frontier Models

Mistral AI announced a strategic partnership with NVIDIA as a founding member of the Nemotron Coalition, a global initiative to co-develop open-source frontier AI models. The first project is a base model trained on NVIDIA DGX Cloud.

ElevenLabsMay 20

ElevenLabs Launches Speech Engine for Plug and Play Voice Agent Upgrades

ElevenLabs released Speech Engine, a unified pipeline that combines transcription, speech synthesis, and conversational orchestration into a single API. The tool allows developers to add a low-latency voice layer to existing text-based agents without rearchitecting their underlying model or retrieval systems.

Google Launches Gemini 3.1 Flash TTS With Natural Language Audio Tags

Google DeepMindApr 23

Google Launches Gemini 3.1 Flash TTS With Natural Language Audio Tags

Google released Gemini 3.1 Flash TTS, a text-to-speech model that uses natural language audio tags to control vocal style, pace, and delivery. This update allows users to direct AI speech like a human performance while maintaining low costs and high speed.