ChaosBench-Logic v2: LLM Logical Reasoning Benchmark Over Dynamical Systems
ChaosBench-Logic v2 has been introduced as a new standard for assessing logical reasoning in large language models (LLMs) within dynamical systems. This benchmark features 40,886 questions spanning 165 systems, utilizing 27 first-order logic predicates and 78 axiom edges. The CARE protocol highlights issues such as prior collapse and inconsistencies when paraphrased. An evaluation of 14 models reveals that regime-transition reasoning is nearly random (MCC=0.05), while first-order logic deduction achieves an MCC of 0.52. Proprietary models perform well on cross-indicator (+0.40) and consistency tasks, whereas the open-source Qwen 2.5-32B excels in indicator diagnostics (0.91 vs. 0.45). Notably, two models exhibit negative MCC on bifurcation questions, suggesting a systematic anti-correlation.
Key facts
- ChaosBench-Logic v2 includes 40,886 questions over 165 dynamical systems.
- The benchmark uses 27 FOL predicates and 78 axiom edges.
- CARE protocol surfaces pathologies like prior collapse and inconsistency under paraphrase.
- 14 models were evaluated.
- Regime-transition reasoning achieved near random performance (MCC=0.05).
- FOL deduction with given premises reached MCC=0.52.
- Proprietary models showed advantage on cross-indicator (+0.40) and consistency tasks.
- Qwen 2.5-32B dominated indicator diagnostics (0.91 vs. 0.45).
Entities
Institutions
- arXiv