Google Releases Prompting Formula for Granular Control of Gemini 3.1 TTS

Traditional text-to-speech often relies on global settings that apply one emotion to an entire text block. These syntax rules enable precise word-level transitions, allowing a single output to shift from a whisper to a cackle. This precision mirrors the shift toward production-grade voice agents.
You can direct performances by inserting tags like [slow] or [short pause] exactly where transitions occur. The model supports vocalizations like [laughs] and stylistic cues like [scholarly], provided tags are not placed directly next to each other. These are available via the Gemini API for language learning and customer service.
Frequently asked questions
- What are Gemini 3.1 TTS audio tags?
- Audio tags are a prompting feature for Google's Gemini 3.1 text-to-speech model that allows users to direct vocal style, pace, and delivery. By embedding natural language commands in square brackets directly into the text, you can guide the AI to perform specific vocalizations or change its tone at exact moments during the speech generation.
- How do you format audio tags for Gemini 3.1 TTS?
- To use audio tags effectively, you must enclose all inline commands in square brackets, such as [whispers] or [fast]. These tags should be placed exactly where you want the vocal transition to occur. Google recommends avoiding placing tags directly next to each other to ensure the model processes the transitions smoothly and maintains a natural delivery.
- What specific vocal styles can I control with Gemini 3.1 TTS?
- Gemini 3.1 TTS offers granular control over several aspects of speech. You can manage pacing with tags like [slow] or [fast], and control timing using [short pause] or [long pause]. The model also supports specific vocalizations such as [cackles], [laughs], or [whispers], as well as broader stylistic tones like [encouraging], [scholarly], [mysterious], and [friendly].
- Who can use Gemini 3.1 TTS and its audio tags?
- Gemini 3.1 TTS is available for developers and businesses building audio-first applications through the Gemini API. It is designed for use cases ranging from language learning tools and interactive podcast apps to adaptive customer service offerings. The model provides an intuitive way to guide vocal style and delivery, allowing for more natural and expressive speech generation in production environments.

