ARTFEED — Contemporary Art Intelligence

OpenAI Launches WebSocket Mode for Responses API, Cutting Agent Latency by 40%

ai-technology · 2026-04-24

OpenAI has introduced WebSocket mode for its Responses API, enabling persistent connections that reduce end-to-end latency in agentic workflows by up to 40%. The feature was developed to keep pace with faster inference models like GPT-5.3-Codex-Spark, which runs at over 1,000 tokens per second on Cerebras hardware. Previously, each agent step required a new HTTP request, causing cumulative overhead. WebSocket mode caches conversation state in memory, allowing follow-up requests to skip redundant processing. Alpha users including Vercel, Cline, and Cursor reported latency improvements of 30–40%. The mode supports existing API shapes via previous_response_id, minimizing developer disruption. The launch follows a two-month sprint by OpenAI's API and Codex teams, with Codex now routing the majority of its traffic through WebSockets.

Key facts

  • WebSocket mode reduces agentic workflow latency by up to 40%.
  • GPT-5.3-Codex-Spark achieves over 1,000 tokens per second, with bursts up to 4,000 TPS.
  • The feature caches previous response state in memory to avoid rebuilding full conversation history.
  • Alpha users include Vercel (40% latency decrease), Cline (39% faster), and Cursor (30% faster).
  • WebSocket mode uses a persistent connection instead of synchronous HTTP calls.
  • The Responses API was launched in March 2025.
  • Optimizations include caching rendered tokens, reducing network hops, and improving safety classifiers.
  • The feature was developed by OpenAI's API and Codex teams over two months.

Entities

Institutions

  • OpenAI
  • Cerebras
  • Vercel
  • Cline
  • Cursor

Sources