Submodular Benchmark Selection for Efficient LLM Evaluation

ai-technology · 2026-05-06

Researchers have established a formal approach to choose a compact, informative set of benchmarks for assessing large language models, framing it as a submodular maximization issue within a multivariate Gaussian context. The objectives of entropy (log-determinant covariance) and mutual information between the chosen and remaining benchmarks are inherently submodular. While entropy selection aligns with pivoted Cholesky and maintains spectral residual limits, mutual information is generally non-monotone but tends to be empirically monotone for smaller subsets, allowing for greedy optimization. Tests conducted on three matrices sourced from ten public leaderboards indicate that mutual information selection surpasses entropy in effectiveness for imputation with small subsets.

Key facts

Formalizes benchmark selection as submodular maximization under multivariate Gaussian model.
Entropy and mutual information are natural objectives.
Both objectives are submodular.
Entropy selection coincides with pivoted Cholesky.
Mutual information is non-monotone in general but empirically monotone for small subsets.
Mutual information selection outperforms entropy for imputation at small subsets.
Experiments conducted on three matrices from ten public leaderboards.
Evaluating LLMs across many benchmarks is expensive and many are correlated.

Submodular Benchmark Selection for Efficient LLM Evaluation

Key facts

Entities

Institutions

Sources