Efficient LAM Evaluation Achieves 0.93 Correlation with Full Benchmarks Using 0.3% Data
Researchers propose using minimal subsets of just 50 examples (0.3% of data) to evaluate large audio models (LAMs), achieving over 0.93 Pearson correlation with full benchmark scores. The study analyzed 10 subset selection methods across 18 audio models and 40 tasks. To align with user satisfaction, 776 human preference ratings were collected from realistic voice assistant conversations, revealing only 0.85 correlation between both subsets and full benchmarks with human preferences. Regression models trained on selected subsets achieved 0.98 correlation, outperforming those trained on full data. The findings suggest efficient evaluation can reduce costs while maintaining reliability.
Key facts
- Subsets of 50 examples (0.3% of data) achieve over 0.93 Pearson correlation with full benchmark scores
- Study analyzed 10 subset selection methods with 18 audio models across 40 tasks
- 776 human preference ratings collected from realistic voice assistant conversations
- Both subsets and full benchmark achieve only 0.85 correlation with human preferences
- Regression models on selected subsets achieve 0.98 correlation with human preferences
- Full benchmark correlation with human preferences is 0.85
- Study aims to reduce costs and data redundancy in LAM evaluation
- Research published on arXiv (2605.00022)
Entities
Institutions
- arXiv