Replication-First Paradigm for LLM Behavioral Benchmarking
A new paper on arXiv proposes a replication-first paradigm for evaluating LLM behavior, addressing the limitations of subjective human ratings and LLM-as-judge approaches. The method certifies instruments through four properties: reliability across runs, cross-instrument replication, historical calibration, and pre-registered prediction. Tested on emotional accompaniment, the rubric self-evolves to a stable 9-dimension structure.
Key facts
- Human inter-rater agreement on subjective LLM qualities saturates near rho ~ 0.45
- LLM-as-judge proxy risks circularity if judge shares target's training cohort
- Proposed paradigm uses four orthogonal properties: reliability, cross-instrument replication, historical calibration, pre-registered prediction
- Tested on emotional accompaniment with a data-driven self-evolving rubric
- Procedure stabilizes to a 9-dimension structure
Entities
—