Multi-Level Bootstrapping Improves LLM Evaluation Reproducibility
A new study from arXiv introduces a multi-level bootstrapping approach to model annotator behavior in large language model (LLM) evaluations, addressing the reproducibility crisis in AI. The method uses persistent rater identifiers to capture individual variance, overcoming the limitations of standard practices that rely on 3-5 annotations per item without tracking annotator identity. The approach realistically simulates how experimental repeatability improves as the annotator pool grows, providing a framework for more reliable assessments of LLM safety and trustworthiness.
Key facts
- arXiv paper 2605.13801 proposes multi-level bootstrapping for annotator modeling.
- Standard evaluation uses 3-5 annotations per item without persistent rater IDs.
- The method models individual annotator variance across items.
- Aims to improve reproducibility in LLM evaluations.
- Addresses the reproducibility crisis in AI research.
- Focuses on safety, robustness, and trustworthiness of generative AI.
- Human raters introduce divergent biases and subjective opinions.
- Little data exists on how repeatability improves with larger annotator pools.
Entities
Institutions
- arXiv