Google Releases Prompting Formula for Granular Control of Gemini 3.1 TTS

GoogleGoogle

· Updated

Google released a formal prompting framework for Gemini 3.1 TTS that uses inline audio tags to control speech style and pacing. This update provides the specific syntax and constraints needed to direct AI voices like human actors, enabling dynamic and expressive vocal performances.

Google released a formal prompting guide for Gemini 3.1 TTS, a text-to-speech model for expressive audio. The update provides a formula for using audio tags—natural language commands in square brackets—to direct vocal style. This follows the initial launch of the model, which introduced the underlying tag capability.
Syntax
Square brackets
Placement
Inline at transition point
Pacing tags
slow, fast
Pause tags
short pause, long pause
Vocalization tags
whispers, cackles, laughs, screams
Style tags
encouraging, scholarly, mysterious, friendly
Availability
Gemini 3.1 TTS via Gemini API

Traditional text-to-speech often relies on global settings that apply one emotion to an entire text block. These syntax rules enable precise word-level transitions, allowing a single output to shift from a whisper to a cackle. This precision extends the utility of Google's production-grade voice agents.

You can direct performances by inserting tags like [slow] or [short pause] exactly where transitions occur. The model supports vocalizations like [laughs] and stylistic cues like [scholarly], provided tags are not placed directly next to each other. This builds on the model's status as a leader in expressive speech generation for language learning and customer service.

Google AI
Google AI
@GoogleAI
X

Last week, we launched Gemini 3.1 TTS, our latest and best text-to-speech model. This new model introduces [awe] audio tags, an intuitive way to guide vocal style, pace, and delivery. Here are some tips on the best ways to use audio tags in your prompts: 1. All inline tags must be enclosed in square brackets, such as [screams] or [whispers] 2. Insert these tags exactly where you want the transition to occur and make sure to avoid placing tags directly next to each other 3. Use tags like [slow] or [fast] to control the pace of the delivery, or even [short pause] or [long pause] to ramp up the anticipation in dramatic moments 4. The model also offers granular control over vocalizations, allowing you to direct the delivery with cues like [cackles] or [whispers] 5. An ideal audio tag formula could look something like: [encouraging] Let’s try that last sentence again to make sure that you nailed it. [slow] "L'oiseau s'est envolé." [short pause] Perfect! [laughs] You're a natural. No matter what you’re developing — from [scholarly] a language learning tool, to [mysterious] an interactive podcast app, to [friendly] more adaptive customer service offerings, and beyond — these prompting tips will equip you to start building with Gemini 3.1 TTS.

26retweets226likes
View on X

Still wondering? A few quick answers below.

Audio tags are a prompting feature for Google's Gemini 3.1 text-to-speech model that allows users to direct vocal style, pace, and delivery. By embedding natural language commands in square brackets directly into the text, you can guide the AI to perform specific vocalizations or change its tone at exact moments during the speech generation.

To use audio tags effectively, you must enclose all inline commands in square brackets, such as [whispers] or [fast]. These tags should be placed exactly where you want the vocal transition to occur. Google recommends avoiding placing tags directly next to each other to ensure the model processes the transitions smoothly and maintains a natural delivery.

Gemini 3.1 TTS offers granular control over several aspects of speech. You can manage pacing with tags like [slow] or [fast], and control timing using [short pause] or [long pause]. The model also supports specific vocalizations such as [cackles], [laughs], or [whispers], as well as broader stylistic tones like [encouraging], [scholarly], [mysterious], and [friendly].

Gemini 3.1 TTS is available for developers and businesses building audio-first applications through the Gemini API. It is designed for use cases ranging from language learning tools and interactive podcast apps to adaptive customer service offerings. The model provides an intuitive way to guide vocal style and delivery, allowing for more natural and expressive speech generation in production environments.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Share this update