DecisionBench: Benchmarking Emergent Delegation in AI Agent Workflows

ai-technology · 2026-05-20

Researchers have introduced DecisionBench, a new framework aimed at evaluating how well agents can delegate tasks over long periods. This setup includes various tasks like GAIA, tau-bench, and BFCL multi-turn, alongside 11 peer models from seven different vendor families. It features a delegation interface with call_model and an optional read_profile channel. There's also a skill-annotation layer and a detailed metric suite that measures factors like quality, cost, latency, and more. Interestingly, a study involving 23,375 tasks showed that the average quality of end results was roughly the same across four different awareness conditions, suggesting current delegation techniques haven't improved task performance.

Key facts

DecisionBench is a benchmark for emergent delegation in long-horizon agentic workflows.
The task suite includes GAIA, tau-bench, and BFCL multi-turn.
The peer-model pool consists of 11 models from 7 vendor families.
Delegation interface includes call_model and optional read_profile channel.
Metric suite covers quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and counterfactual-delegation ceiling.
The substrate is agnostic to how peer information is generated or delivered.
A five-condition reference sweep was conducted on 23,375 task instances.
Mean end-task quality is statistically indistinguishable across four awareness conditions.

Entities

—

Sources

arXiv cs.AI — 2026-05-20