HeadsUpAI

OpenAI Releases Prompting Guide to Control Reasoning Effort in Voice Agents

OpenAI released a technical guide for gpt-realtime-2, its reasoning voice model for low-latency speech-to-speech applications. It introduces a reasoning.effort parameter (controls internal processing time before responding) with levels from minimal to high. This allows developers to trade speed for deeper logic.
Reasoning effort levels
minimal, low, medium, high
Context window
128K tokens
Response phases
commentary, final_answer
Preamble length
One to two sentences
Availability
OpenAI API

This update shifts voice AI from simple conversational loops to reasoning-capable agents that can plan multi-step actions. By formalizing preambles—short spoken updates that fill silence during reasoning—OpenAI addresses voice latency. It builds on the gpt-realtime-2 launch to provide engineering patterns for reliable, high-precision audio interfaces.

You can now implement entity capture workflows that use digit-by-digit confirmation for high-precision data like order IDs. The model also supports an expanded 128k token context window (the amount of information a model can process at once), enabling sessions lasting up to two hours. These capabilities are available via the OpenAI API.

OpenAI Developers
OpenAI Developers
@OpenAIDevs
X

Building voice applications with GPT-Realtime-2? Our new prompting guide covers how to tune reasoning effort, use preambles, design tool behavior, handle unclear audio, capture exact entities, and maintain state in longer sessions. https://t.co/9zfdhIX4Vq

58retweets526likes
View on X

Still wondering? A few quick answers below.

gpt-realtime-2 is a reasoning voice model designed for low-latency speech-to-speech applications. Unlike standard voice models, it can perform internal reasoning before responding, allowing it to follow complex instructions and use tools with higher precision. It is built to handle multi-step tasks and high-precision data capture within a conversational voice interface.

Developers can tune the reasoning effort of gpt-realtime-2 using a specific API parameter with four levels: minimal, low, medium, and high. This setting allows you to balance the model's intelligence against response latency. Minimal effort provides the fastest responses for simple tasks, while high effort enables deeper reasoning for complex troubleshooting or multi-step workflows.

Preambles are short spoken updates, such as "I'll check that for you," that a voice agent says before performing a longer reasoning process or tool call. They are designed to keep the conversation feeling responsive and reassure the user that work is happening, preventing awkward silences that might occur while the model is thinking or accessing data.

To capture high-precision data like order IDs or email addresses, gpt-realtime-2 uses a conservative entity capture workflow. This involves collecting one value at a time, normalizing the input, and reading it back to the user for confirmation. For numeric identifiers, the model is instructed to read values back digit by digit to ensure accuracy before proceeding with tool calls.

The gpt-realtime-2 model features an expanded context window of 128,000 tokens, which is a significant increase from the 32,000 tokens available in earlier realtime models. This larger window allows the model to maintain state and memory over long sessions, supporting approximately one to two hours of dense audio conversation without losing track of the dialogue history.

Share this update