Proxy State-Based Evaluation for Multi-Turn LLM Agents
A new benchmark for multi-turn tool-calling LLM agents uses proxy state-based evaluation to avoid costly deterministic backends. The framework, proposed in arXiv:2602.16246, employs an LLM state tracker to infer structured proxy states from interaction traces, with LLM judges verifying goal completion and detecting hallucinations. It aims to produce stable, model-differentiating rankings.
Key facts
- arXiv:2602.16246v3
- Proxy State-Based Evaluation is an LLM-driven simulation framework
- Preserves final state-based evaluation without a deterministic database
- Scenario specifies user goal, user/system facts, expected final state, and expected agent behavior
- LLM state tracker infers structured proxy state from full interaction trace
- LLM judges verify goal completion and detect tool/user hallucinations
- Prior benchmarks: tau-bench, tau^2-bench, AppWorld rely on fully deterministic backends
- Empirically produces stable, model-differentiating rankings
Entities
—