ARTFEED — Contemporary Art Intelligence

Proxy State-Based Evaluation for Multi-Turn LLM Agents

other · 2026-05-14

A new benchmark for multi-turn tool-calling LLM agents uses proxy state-based evaluation to avoid costly deterministic backends. The framework, proposed in arXiv:2602.16246, employs an LLM state tracker to infer structured proxy states from interaction traces, with LLM judges verifying goal completion and detecting hallucinations. It aims to produce stable, model-differentiating rankings.

Key facts

  • arXiv:2602.16246v3
  • Proxy State-Based Evaluation is an LLM-driven simulation framework
  • Preserves final state-based evaluation without a deterministic database
  • Scenario specifies user goal, user/system facts, expected final state, and expected agent behavior
  • LLM state tracker infers structured proxy state from full interaction trace
  • LLM judges verify goal completion and detect tool/user hallucinations
  • Prior benchmarks: tau-bench, tau^2-bench, AppWorld rely on fully deterministic backends
  • Empirically produces stable, model-differentiating rankings

Entities

Sources