NeuroState-Bench: Benchmarking Commitment Integrity in LLM Agents
NeuroState-Bench is a human-calibrated benchmark designed to evaluate commitment integrity in LLM agent profiles. It uses benchmark-defined side-query probes instead of inferred hidden activations to assess whether an agent preserves commitments across multi-turn tasks. The benchmark includes 144 deterministic tasks and 306 side-query probes covering eight cognitively motivated failure families, with clean and distractor variants across three difficulty bands. The main evaluation involves 32 profiles: 16 local and 16 hosted large-model profiles. Human calibration was performed on 104 sampled task units, with 216 raw annotations and 108 adjudicated task rows, achieving weighted kappa = 0.977 and ICC(2,1) = 0.977. The benchmark reveals that task success and commitment integrity are distinct dimensions of agent performance.
Key facts
- NeuroState-Bench evaluates commitment integrity in LLM agent profiles.
- Uses benchmark-defined side-query probes rather than hidden activations.
- Contains 144 deterministic tasks and 306 side-query probes.
- Covers eight cognitively motivated failure families.
- Includes clean and distractor variants across three difficulty bands.
- Main evaluation involves 32 profiles: 16 local and 16 hosted large-model.
- Human calibration on 104 task units achieved weighted kappa = 0.977.
- Task success and commitment integrity are distinct performance dimensions.
Entities
—