OpenAI Launches WebSocket Mode for Responses API, Cutting Agent Latency by 40%
OpenAI has introduced WebSocket mode for its Responses API, enabling persistent connections that reduce end-to-end latency in agentic workflows by up to 40%. The feature was developed to keep pace with faster inference models like GPT-5.3-Codex-Spark, which runs at over 1,000 tokens per second on Cerebras hardware. Previously, each agent step required a new HTTP request, causing cumulative overhead. WebSocket mode caches conversation state in memory, allowing follow-up requests to skip redundant processing. Alpha users including Vercel, Cline, and Cursor reported latency improvements of 30–40%. The mode supports existing API shapes via previous_response_id, minimizing developer disruption. The launch follows a two-month sprint by OpenAI's API and Codex teams, with Codex now routing the majority of its traffic through WebSockets.
Key facts
- WebSocket mode reduces agentic workflow latency by up to 40%.
- GPT-5.3-Codex-Spark achieves over 1,000 tokens per second, with bursts up to 4,000 TPS.
- The feature caches previous response state in memory to avoid rebuilding full conversation history.
- Alpha users include Vercel (40% latency decrease), Cline (39% faster), and Cursor (30% faster).
- WebSocket mode uses a persistent connection instead of synchronous HTTP calls.
- The Responses API was launched in March 2025.
- Optimizations include caching rendered tokens, reducing network hops, and improving safety classifiers.
- The feature was developed by OpenAI's API and Codex teams over two months.
Entities
Institutions
- OpenAI
- Cerebras
- Vercel
- Cline
- Cursor