ARTFEED — Contemporary Art Intelligence

ChaosBench-Logic v2: LLM Logical Reasoning Benchmark Over Dynamical Systems

ai-technology · 2026-05-26

ChaosBench-Logic v2 has been introduced as a new standard for assessing logical reasoning in large language models (LLMs) within dynamical systems. This benchmark features 40,886 questions spanning 165 systems, utilizing 27 first-order logic predicates and 78 axiom edges. The CARE protocol highlights issues such as prior collapse and inconsistencies when paraphrased. An evaluation of 14 models reveals that regime-transition reasoning is nearly random (MCC=0.05), while first-order logic deduction achieves an MCC of 0.52. Proprietary models perform well on cross-indicator (+0.40) and consistency tasks, whereas the open-source Qwen 2.5-32B excels in indicator diagnostics (0.91 vs. 0.45). Notably, two models exhibit negative MCC on bifurcation questions, suggesting a systematic anti-correlation.

Key facts

  • ChaosBench-Logic v2 includes 40,886 questions over 165 dynamical systems.
  • The benchmark uses 27 FOL predicates and 78 axiom edges.
  • CARE protocol surfaces pathologies like prior collapse and inconsistency under paraphrase.
  • 14 models were evaluated.
  • Regime-transition reasoning achieved near random performance (MCC=0.05).
  • FOL deduction with given premises reached MCC=0.52.
  • Proprietary models showed advantage on cross-indicator (+0.40) and consistency tasks.
  • Qwen 2.5-32B dominated indicator diagnostics (0.91 vs. 0.45).

Entities

Institutions

  • arXiv

Sources