Active Testing for LLMs via Approximate Neyman Allocation

ai-technology · 2026-05-12

A new active testing algorithm for generative tasks in large language models (LLMs) has been introduced. The method uses semantic entropy from surrogate models to stratify the evaluation pool, then applies approximate Neyman allocation based on surrogate signals. This approach aims to reduce evaluation costs by selecting a small but informative subset of data. Tests across multiple language and multimodal benchmarks show significant improvements over existing methods, which primarily target classification tasks and fail on generative ones. The work addresses the growing need for efficient LLM evaluation as model scales and expert annotation costs rise.

Key facts

arXiv:2605.10075v1
Active testing aims to approximate evaluation results from a small subset of the evaluation pool.
Existing active testing approaches primarily target classification and break down on generative tasks.
The new algorithm is tailored to generative tasks.
It leverages semantic entropy from surrogate models to stratify the evaluation pool.
Approximate Neyman allocation is conducted based on signals from surrogates.
Tests were performed across multiple language and multimodal benchmarks.
The method significantly improves on baseline approaches.

Active Testing for LLMs via Approximate Neyman Allocation

Key facts

Entities

Institutions

Sources