DeepSeek-V4: Million-Token Context Model Optimized for Agent Workloads

ai-technology · 2026-04-24

DeepSeek has unveiled V4, a suite of open-weight AI models designed for prolonged agentic tasks, effectively tackling issues related to KV cache and context budget limitations. This new architecture incorporates hybrid attention, with Compressed Sparse Attention (CSA) achieving a 4x reduction in KV entries, while Heavily Compressed Attention (HCA) boasts a 128x compression. When compared to V3.2, V4-Pro utilizes only 27% of the single-token inference FLOPs and 10% of the KV cache memory, while V4-Flash operates at 10% FLOPs and 7% KV cache. Benchmarking indicates that V4-Pro-Max attained scores of 67.9 on Terminal Bench 2.0 and 80.6 on SWE Verified. A survey revealed that 52% of 85 DeepSeek developers believe V4-Pro is ready to succeed their main coding model. Four checkpoints are accessible on Hugging Face.

Key facts

DeepSeek-V4 introduces hybrid attention with CSA (4x compression) and HCA (128x compression) to reduce KV cache and FLOPs.
V4-Pro requires 27% of single-token inference FLOPs and 10% of KV cache compared to V3.2; V4-Flash drops to 10% FLOPs and 7% KV cache.
V4 uses roughly 2% of the KV cache memory of grouped query attention with 8 heads in bfloat16.
Interleaved thinking preserves reasoning across user turns when tool calls are present.
New tool-call schema uses |DSML| token and XML format to reduce escaping errors.
DSec sandbox platform enables RL training with fast image loading and preemption-safe replay.
V4-Pro-Max scores 80.6 on SWE Verified, within a point of Opus-4.6-Max and Gemini-3.1-Pro.
Four checkpoints released: V4-Pro (1.6T/49B activated) and V4-Flash (284B/13B activated), each with instruct and base versions.

DeepSeek-V4: Million-Token Context Model Optimized for Agent Workloads

Key facts

Entities

Institutions

Sources