AgentProp-Bench Study Reveals LLM Tool-Use Evaluation Flaws and Mitigation Strategies

ai-technology · 2026-04-22

A new benchmark called AgentProp-Bench challenges the assumed reliability of automated evaluation for tool-using large language model agents. The study introduces 2,000 tasks with 2,300 traces across four domains, testing nine production LLMs and including a 100-label subset validated by human annotators. Researchers quantified judge reliability, finding that substring-based judging achieved only chance-level agreement with human annotation at kappa=0.049. A three-LLM ensemble improved performance to moderate agreement at kappa=0.432 but exhibited conservative bias. Under validated evaluation conditions, parameter-level injections were shown to propagate to incorrect final answers with human-calibrated probability of approximately 0.62, ranging from 0.46 to 0.73 across different models. The study revealed that rejection capabilities (catching bad parameters) and recovery capabilities (correcting after acceptance) represent independent model functions, with Spearman correlation rho=0.126 and p=0.747. A tuned runtime interceptor was developed to reduce hallucinations in tool-using agents. The research was published on arXiv with identifier 2604.16706v1, marking a significant contribution to understanding evaluation methodologies for AI systems that utilize external tools.

Key facts

AgentProp-Bench contains 2,000 tasks with 2,300 traces across four domains
The benchmark includes a 100-label human-validated subset
Nine production LLMs were evaluated in the study
Substring-based judging achieved only chance-level agreement with human annotation (kappa=0.049)
A three-LLM ensemble reached moderate agreement (kappa=0.432) with conservative bias
Parameter-level injections propagate to wrong final answers with probability approximately 0.62
Rejection and recovery capabilities are independent model functions (Spearman rho=0.126, p=0.747)
A tuned runtime interceptor was developed to reduce hallucinations

AgentProp-Bench Study Reveals LLM Tool-Use Evaluation Flaws and Mitigation Strategies

Key facts

Entities

Institutions

Sources