ARTFEED — Contemporary Art Intelligence

NeuroState-Bench: Benchmarking Commitment Integrity in LLM Agents

ai-technology · 2026-05-06

NeuroState-Bench is a human-calibrated benchmark designed to evaluate commitment integrity in LLM agent profiles. It uses benchmark-defined side-query probes instead of inferred hidden activations to assess whether an agent preserves commitments across multi-turn tasks. The benchmark includes 144 deterministic tasks and 306 side-query probes covering eight cognitively motivated failure families, with clean and distractor variants across three difficulty bands. The main evaluation involves 32 profiles: 16 local and 16 hosted large-model profiles. Human calibration was performed on 104 sampled task units, with 216 raw annotations and 108 adjudicated task rows, achieving weighted kappa = 0.977 and ICC(2,1) = 0.977. The benchmark reveals that task success and commitment integrity are distinct dimensions of agent performance.

Key facts

  • NeuroState-Bench evaluates commitment integrity in LLM agent profiles.
  • Uses benchmark-defined side-query probes rather than hidden activations.
  • Contains 144 deterministic tasks and 306 side-query probes.
  • Covers eight cognitively motivated failure families.
  • Includes clean and distractor variants across three difficulty bands.
  • Main evaluation involves 32 profiles: 16 local and 16 hosted large-model.
  • Human calibration on 104 task units achieved weighted kappa = 0.977.
  • Task success and commitment integrity are distinct performance dimensions.

Entities

Sources