Interactive AI Evaluation Needs a Design Science
A new position paper argues that evaluating large language models (LLMs) deployed as interactive systems requires a fundamental shift from static benchmarks to a principled evaluation paradigm. The paper, posted on arXiv, notes that current interactive benchmarks are fragmented, differing in artifacts, scoring methods, and claims. It defines evaluation as an autonomous mapping from evidence to judgments and shows that interactive evaluation changes both sides of this mapping. The authors call for a design science approach to build robust evaluation frameworks for LLMs acting over time through tools, environments, users, and other agents.
Key facts
- arXiv paper 2605.17829v1
- Announce type: new
- Abstract discusses structural change in AI evaluation
- LLMs deployed as systems acting over time
- Current evaluation practices inherit assumptions from response-centered benchmarks
- Interactive benchmarks are fragmented
- Paper argues for a principled evaluation paradigm
- Defines evaluation as autonomous mapping from evidence to judgments
Entities
Institutions
- arXiv