Gemini 3.1 Flash TTS is our most controllable text-to-speech model yet. With new Audio Tags, you can easily direct vocal style, delivery, and pace through text commands. 🧵
Google Launches Gemini 3.1 Flash TTS With Natural Language Audio Tags
· Updated
Google launched Gemini 3.1 Flash TTS, a specialized text-to-speech model for expressive audio generation. The model introduces audio tags—natural language commands embedded in text to steer vocal style and pacing. It supports over 70 languages and includes SynthID watermarking (an imperceptible digital signal) to identify AI-generated content.
Traditional text-to-speech often requires complex markup or produces robotic results. By moving control to natural language, this model lowers the barrier for creating immersive voice experiences. It currently leads the Artificial Analysis TTS leaderboard for its balance of high-quality human preference scores and low inference (running a trained model) costs.
Access the model in preview through the Gemini API and Google AI Studio, which features a "Director's Chair" interface for scene direction. Enterprise users can find it on Vertex AI, while Workspace users will see it rolling out in Google Vids for automated video voiceovers.
Google DeepMind
@GoogleDeepMind
111retweets1klikes
View on X



