Replication-First Paradigm for LLM Behavioral Benchmarking

ai-technology · 2026-05-28

A new paper on arXiv proposes a replication-first paradigm for evaluating LLM behavior, addressing the limitations of subjective human ratings and LLM-as-judge approaches. The method certifies instruments through four properties: reliability across runs, cross-instrument replication, historical calibration, and pre-registered prediction. Tested on emotional accompaniment, the rubric self-evolves to a stable 9-dimension structure.

Key facts

Human inter-rater agreement on subjective LLM qualities saturates near rho ~ 0.45
LLM-as-judge proxy risks circularity if judge shares target's training cohort
Proposed paradigm uses four orthogonal properties: reliability, cross-instrument replication, historical calibration, pre-registered prediction
Tested on emotional accompaniment with a data-driven self-evolving rubric
Procedure stabilizes to a 9-dimension structure

Entities

—

Sources

arXiv cs.AI — 2026-05-28