Efficient LAM Evaluation Achieves 0.93 Correlation with Full Benchmarks Using 0.3% Data

ai-technology · 2026-05-04

Researchers propose using minimal subsets of just 50 examples (0.3% of data) to evaluate large audio models (LAMs), achieving over 0.93 Pearson correlation with full benchmark scores. The study analyzed 10 subset selection methods across 18 audio models and 40 tasks. To align with user satisfaction, 776 human preference ratings were collected from realistic voice assistant conversations, revealing only 0.85 correlation between both subsets and full benchmarks with human preferences. Regression models trained on selected subsets achieved 0.98 correlation, outperforming those trained on full data. The findings suggest efficient evaluation can reduce costs while maintaining reliability.

Key facts

Subsets of 50 examples (0.3% of data) achieve over 0.93 Pearson correlation with full benchmark scores
Study analyzed 10 subset selection methods with 18 audio models across 40 tasks
776 human preference ratings collected from realistic voice assistant conversations
Both subsets and full benchmark achieve only 0.85 correlation with human preferences
Regression models on selected subsets achieve 0.98 correlation with human preferences
Full benchmark correlation with human preferences is 0.85
Study aims to reduce costs and data redundancy in LAM evaluation
Research published on arXiv (2605.00022)

Efficient LAM Evaluation Achieves 0.93 Correlation with Full Benchmarks Using 0.3% Data

Key facts

Entities

Institutions

Sources