HiL-Bench: Benchmarking AI Agents' Judgment to Ask for Help

ai-technology · 2026-05-01

A new benchmark, HiL-Bench (Human-in-the-Loop Benchmark), addresses a critical failure mode in frontier coding agents: their inability to recognize when to ask for help. While these agents excel at complex tasks with complete context, they collapse under ambiguous or incomplete specifications. Current benchmarks reward execution correctness only, ignoring the judgment gap. HiL-Bench introduces tasks with human-validated blockers—missing information, ambiguous requests, or contradictions—that emerge only through progressive exploration. Its core metric, Ask-F1, balances question precision and blocker recall to penalize both over-asking and silent guessing. The benchmark aims to measure selective escalation skill, a key bottleneck in autonomous AI.

Key facts

HiL-Bench stands for Human-in-the-Loop Benchmark.
It measures AI agents' ability to ask for help when faced with ambiguous or incomplete specifications.
Current benchmarks are blind to this failure mode, rewarding only execution correctness.
Each task contains human-validated blockers that surface through progressive exploration.
The core metric is Ask-F1, the harmonic mean of question precision and blocker recall.
Frontier coding agents collapse when specifications are incomplete or ambiguous.
The bottleneck is judgment, not raw capability.
The benchmark was introduced on arXiv with identifier 2604.09408.

HiL-Bench: Benchmarking AI Agents' Judgment to Ask for Help

Key facts

Entities

Institutions

Sources