OpenAI Launches WebSocket Mode for Responses API, Cutting Agent Latency by 40%

ai-technology · 2026-04-24

OpenAI has introduced WebSocket mode for its Responses API, enabling persistent connections that reduce end-to-end latency in agentic workflows by up to 40%. The feature was developed to keep pace with faster inference models like GPT-5.3-Codex-Spark, which runs at over 1,000 tokens per second on Cerebras hardware. Previously, each agent step required a new HTTP request, causing cumulative overhead. WebSocket mode caches conversation state in memory, allowing follow-up requests to skip redundant processing. Alpha users including Vercel, Cline, and Cursor reported latency improvements of 30–40%. The mode supports existing API shapes via previous_response_id, minimizing developer disruption. The launch follows a two-month sprint by OpenAI's API and Codex teams, with Codex now routing the majority of its traffic through WebSockets.

Key facts

WebSocket mode reduces agentic workflow latency by up to 40%.
GPT-5.3-Codex-Spark achieves over 1,000 tokens per second, with bursts up to 4,000 TPS.
The feature caches previous response state in memory to avoid rebuilding full conversation history.
Alpha users include Vercel (40% latency decrease), Cline (39% faster), and Cursor (30% faster).
WebSocket mode uses a persistent connection instead of synchronous HTTP calls.
The Responses API was launched in March 2025.
Optimizations include caching rendered tokens, reducing network hops, and improving safety classifiers.
The feature was developed by OpenAI's API and Codex teams over two months.

Entities

Institutions

OpenAI
Cerebras
Vercel
Cline
Cursor

Sources

OpenAI Blog — 2026-04-22