⚙️ We made agent loops faster with WebSockets in the Responses API As Codex got faster, the bottleneck moved from inference to inefficient API calls WebSockets keep response state warm across tool calls, helping workflows run up to 40% faster end to end https://t.co/nFeUEdRdKt
OpenAI Speeds Up Agentic Loops With Persistent WebSocket Connections
OpenAI launched WebSocket support for the Responses API, introducing a persistent connection mode for high-frequency agentic loops (the iterative cycle of AI planning and action). Instead of establishing new HTTP connections for every tool call, the API now keeps a "warm" state that caches tool definitions and conversation history in memory.
- End-to-end speedup
- Up to 40%
- Inference speed
- 1,000 tokens per second
- Peak burst speed
- 4,000 tokens per second
- Protocol
- WebSockets
- Availability
- Responses API
- Optimized model
- GPT-5.3-Codex-Spark
As model inference (the process of generating outputs) speeds have surged past 1,000 tokens per second, the API's structural overhead became the primary bottleneck. This shift follows the OpenAI performance roadmap and extends support for multi-day autonomous Codex workflows that require high-speed, persistent execution for complex engineering tasks.
You can implement WebSocket mode by passing a previous_response_id to continue a conversation without re-sending the full history, matching the pattern seen in Vercel's AI SDK. The feature is available now for developers using the Responses API, specifically optimized for GPT-5.3-Codex-Spark.
OpenAI Developers
@OpenAIDevs
38retweets586likes
View on XStill wondering? A few quick answers below.
It is a persistent connection transport for the Responses API designed to reduce latency in agentic workflows. By keeping a connection open instead of making repeated HTTP calls, the server can cache conversation state, tool definitions, and rendered tokens in memory, allowing the system to process multi-step tasks significantly faster.
OpenAI reports that agentic workflows run up to 40% faster end to end when using WebSocket mode. This improvement is achieved by eliminating redundant API work, such as re-tokenizing the full conversation history and re-validating requests, which previously added significant overhead as model inference speeds increased toward 1,000 tokens per second.
Developers can use the familiar response.create structure but include a previous_response_id to continue a conversation from a cached state. This allows the server to fetch the previous response object and prior input items from a connection-scoped cache rather than rebuilding the entire context from scratch for every follow-up request in the loop.
The update was specifically designed to support high-speed models like GPT-5.3-Codex-Spark, which runs on specialized hardware at over 1,000 tokens per second. However, other models including GPT-5.4 and subsequent releases also benefit from the reduced API overhead, making multi-file coding tasks and complex tool-use workflows feel much more responsive.
Following a successful alpha period with coding agent startups like Cursor, Vercel, and Cline, WebSocket mode is now available for general use within the Responses API. It is intended for developers building autonomous systems that require frequent back-and-forth communication between the model and local tools or computer environments.







