Frontier AI Models Show Capability Cooperation but Saturation Looms
A new study from arXiv (2605.18840) analyzes 34 frontier AI models from 10 labs (2024–2026) and finds that capabilities across benchmarks cooperate (r = +0.72, p < 10⁻⁶), but this cooperation varies by lab and over time. DeepSeek reversed from reasoning-rich to coding-first (h: +11.2 → -4.7, 15.9 pp swing), Google maintains consistent reasoning emphasis, and Anthropic oscillates between coding excursions and recovery. Six open-weight architectures confirm a second capability transition at 30–72B parameters. SWE-bench is now saturating, while HLE (Harder than Human-Level Evaluation) emerges as a more informative next metric. The paper introduces a population coupling trend and per-release residual (h-field) to diagnose capability emphasis and identify which measurement is most informative next.
Key facts
- 34 models from 10 labs analyzed over 2024–2026
- Capabilities cooperate across benchmarks (r = +0.72, p < 10⁻⁶)
- DeepSeek reversed from reasoning-rich to coding-first (h: +11.2 → -4.7, 15.9 pp swing)
- Google maintains consistent reasoning emphasis
- Anthropic oscillates between coding excursions and recovery
- Six open-weight architectures show second capability transition at 30–72B
- SWE-bench is saturating; HLE is next informative metric
- Method uses population coupling trend and per-release residual (h-field)
Entities
Institutions
- DeepSeek
- Anthropic
- arXiv