Google Launches Gemini 3.1 Flash TTS With Natural Language Audio Tags

Google DeepMind

Apr 15, 2026 · Updated Apr 30, 2026

Google released Gemini 3.1 Flash TTS, a text-to-speech model that uses natural language audio tags to control vocal style, pace, and delivery. This update allows users to direct AI speech like a human performance while maintaining low costs and high speed.

Google launched Gemini 3.1 Flash TTS, a specialized text-to-speech model for expressive audio generation. The model introduces audio tags—natural language commands embedded in text to steer vocal style and pacing. It supports over 70 languages and includes SynthID watermarking (an imperceptible digital signal) to identify AI-generated content.

Traditional text-to-speech often requires complex markup or produces robotic results. By moving control to natural language, this model lowers the barrier for creating immersive voice experiences. It currently leads the Artificial Analysis TTS leaderboard for its balance of high-quality human preference scores and low inference (running a trained model) costs.

Access the model in preview through the Gemini API and Google AI Studio, which features a "Director's Chair" interface for scene direction. Enterprise users can find it on Vertex AI, while Workspace users will see it rolling out in Google Vids for automated video voiceovers.

View the full update on blog.google

Google DeepMind

@GoogleDeepMindApr 15

Gemini 3.1 Flash TTS is our most controllable text-to-speech model yet. With new Audio Tags, you can easily direct vocal style, delivery, and pace through text commands. 🧵

1111k

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Google →

Keep reading

Google Gemini 3.1 Flash TTS Becomes Flagship for Expressive Speech

Google designated Gemini 3.1 Flash TTS as its most expressive speech generation model to date. The model uses natural language audio tags to allow developers to direct emotional delivery and vocal character within generated audio.

Google Launches Gemini 3.1 Flash Live for Natural Real Time Voice Agents

Google DeepMindMar 28

Google Launches Gemini 3.1 Flash Live for Natural Real Time Voice Agents

Google DeepMind released Gemini 3.1 Flash Live, a low-latency audio model optimized for real-time dialogue and complex task execution. The model improves function calling and tonal recognition, allowing voice agents to handle multi-step workflows and emotional nuances more reliably. This enables more fluid interactions in noisy environments without losing conversational context.

Google Gemini 3.1 Flash Live Claims Top Spot for Production Voice Agents

Google AI StudioApr 13

Google Gemini 3.1 Flash Live Claims Top Spot for Production Voice Agents

Google's Gemini 3.1 Flash Live model reached the #1 position on the Tau Voice Bench leaderboard for real-time voice agents. The update delivers significantly lower latency and higher precision, signaling that multimodal voice AI is now reliable enough for production-grade applications.

Alibaba Fun-Realtime-TTS claims top spot on Speech Arena leaderboard

Artificial AnalysisJun 4

Alibaba Fun-Realtime-TTS claims top spot on Speech Arena leaderboard

Alibaba's latest text-to-speech model has reached #1 on the Artificial Analysis Speech Arena, surpassing Google's Gemini. The model delivers high-fidelity real-time audio with native support for regional accents and voice cloning at a competitive price point.