Proxy State-Based Evaluation for Multi-Turn LLM Agents

other · 2026-05-14

A new benchmark for multi-turn tool-calling LLM agents uses proxy state-based evaluation to avoid costly deterministic backends. The framework, proposed in arXiv:2602.16246, employs an LLM state tracker to infer structured proxy states from interaction traces, with LLM judges verifying goal completion and detecting hallucinations. It aims to produce stable, model-differentiating rankings.

Key facts

arXiv:2602.16246v3
Proxy State-Based Evaluation is an LLM-driven simulation framework
Preserves final state-based evaluation without a deterministic database
Scenario specifies user goal, user/system facts, expected final state, and expected agent behavior
LLM state tracker infers structured proxy state from full interaction trace
LLM judges verify goal completion and detect tool/user hallucinations
Prior benchmarks: tau-bench, tau^2-bench, AppWorld rely on fully deterministic backends
Empirically produces stable, model-differentiating rankings

Entities

—

Sources

arXiv cs.AI — 2026-05-14