ChaosBench-Logic v2: LLM Logical Reasoning Benchmark Over Dynamical Systems

ai-technology · 2026-05-26

ChaosBench-Logic v2 has been introduced as a new standard for assessing logical reasoning in large language models (LLMs) within dynamical systems. This benchmark features 40,886 questions spanning 165 systems, utilizing 27 first-order logic predicates and 78 axiom edges. The CARE protocol highlights issues such as prior collapse and inconsistencies when paraphrased. An evaluation of 14 models reveals that regime-transition reasoning is nearly random (MCC=0.05), while first-order logic deduction achieves an MCC of 0.52. Proprietary models perform well on cross-indicator (+0.40) and consistency tasks, whereas the open-source Qwen 2.5-32B excels in indicator diagnostics (0.91 vs. 0.45). Notably, two models exhibit negative MCC on bifurcation questions, suggesting a systematic anti-correlation.

Key facts

ChaosBench-Logic v2 includes 40,886 questions over 165 dynamical systems.
The benchmark uses 27 FOL predicates and 78 axiom edges.
CARE protocol surfaces pathologies like prior collapse and inconsistency under paraphrase.
14 models were evaluated.
Regime-transition reasoning achieved near random performance (MCC=0.05).
FOL deduction with given premises reached MCC=0.52.
Proprietary models showed advantage on cross-indicator (+0.40) and consistency tasks.
Qwen 2.5-32B dominated indicator diagnostics (0.91 vs. 0.45).

ChaosBench-Logic v2: LLM Logical Reasoning Benchmark Over Dynamical Systems

Key facts

Entities

Institutions

Sources