Frontier AI Models Show Capability Cooperation but Saturation Looms

ai-technology · 2026-05-20

A new study from arXiv (2605.18840) analyzes 34 frontier AI models from 10 labs (2024–2026) and finds that capabilities across benchmarks cooperate (r = +0.72, p < 10⁻⁶), but this cooperation varies by lab and over time. DeepSeek reversed from reasoning-rich to coding-first (h: +11.2 → -4.7, 15.9 pp swing), Google maintains consistent reasoning emphasis, and Anthropic oscillates between coding excursions and recovery. Six open-weight architectures confirm a second capability transition at 30–72B parameters. SWE-bench is now saturating, while HLE (Harder than Human-Level Evaluation) emerges as a more informative next metric. The paper introduces a population coupling trend and per-release residual (h-field) to diagnose capability emphasis and identify which measurement is most informative next.

Key facts

34 models from 10 labs analyzed over 2024–2026
Capabilities cooperate across benchmarks (r = +0.72, p < 10⁻⁶)
DeepSeek reversed from reasoning-rich to coding-first (h: +11.2 → -4.7, 15.9 pp swing)
Google maintains consistent reasoning emphasis
Anthropic oscillates between coding excursions and recovery
Six open-weight architectures show second capability transition at 30–72B
SWE-bench is saturating; HLE is next informative metric
Method uses population coupling trend and per-release residual (h-field)

Frontier AI Models Show Capability Cooperation but Saturation Looms

Key facts

Entities

Institutions

Sources