ARTFEED — Contemporary Art Intelligence

Replication-First Paradigm for LLM Behavioral Benchmarking

ai-technology · 2026-05-28

A new paper on arXiv proposes a replication-first paradigm for evaluating LLM behavior, addressing the limitations of subjective human ratings and LLM-as-judge approaches. The method certifies instruments through four properties: reliability across runs, cross-instrument replication, historical calibration, and pre-registered prediction. Tested on emotional accompaniment, the rubric self-evolves to a stable 9-dimension structure.

Key facts

  • Human inter-rater agreement on subjective LLM qualities saturates near rho ~ 0.45
  • LLM-as-judge proxy risks circularity if judge shares target's training cohort
  • Proposed paradigm uses four orthogonal properties: reliability, cross-instrument replication, historical calibration, pre-registered prediction
  • Tested on emotional accompaniment with a data-driven self-evolving rubric
  • Procedure stabilizes to a 9-dimension structure

Entities

Sources