Interactive AI Evaluation Needs a Design Science

ai-technology · 2026-05-20

A new position paper argues that evaluating large language models (LLMs) deployed as interactive systems requires a fundamental shift from static benchmarks to a principled evaluation paradigm. The paper, posted on arXiv, notes that current interactive benchmarks are fragmented, differing in artifacts, scoring methods, and claims. It defines evaluation as an autonomous mapping from evidence to judgments and shows that interactive evaluation changes both sides of this mapping. The authors call for a design science approach to build robust evaluation frameworks for LLMs acting over time through tools, environments, users, and other agents.

Key facts

arXiv paper 2605.17829v1
Announce type: new
Abstract discusses structural change in AI evaluation
LLMs deployed as systems acting over time
Current evaluation practices inherit assumptions from response-centered benchmarks
Interactive benchmarks are fragmented
Paper argues for a principled evaluation paradigm
Defines evaluation as autonomous mapping from evidence to judgments

Interactive AI Evaluation Needs a Design Science

Key facts

Entities

Institutions

Sources