Evaluation Artifacts Inflate Unsolvability in Multi-LLM Routing
A large-scale study of multi-tier LLM routing across 206,000 query-model pairs reveals that reported unsolvability ceilings are largely due to evaluation artifacts. Using Gemma 4 and Llama 3.1 families on six benchmarks (MMLU, MedQA, HumanEval, MBPP, Alpaca, ShareGPT), researchers found systematic judge biases favoring verbosity over correctness, truncation under fixed generation budgets, and output format mismatches. Dual-judge validation and exact-match grounding reduced measured unsolvability. A decomposition framework attributes failures to these artifacts, showing consistent patterns.
Key facts
- Study involves 206,000 query-model pairs across six benchmarks
- Uses Gemma 4 and Llama 3.1 families
- Evaluates with LLM-as-a-judge and exact-match metrics
- Identifies three evaluation artifacts: judge biases, truncation, format mismatches
- Dual-judge validation and exact-match grounding reduce unsolvability
- Introduces decomposition framework for failure attribution
- Published on arXiv with ID 2605.07395
- Benchmarks include MMLU, MedQA, HumanEval, MBPP, Alpaca, ShareGPT
Entities
Institutions
- arXiv