Median Cross-Entropy Outperforms Mean for Language Model Validation
A recent preprint on arXiv (2605.24667) indicates that the mean cross-entropy (CE), commonly used to validate language models, may not accurately reflect a model's performance during training. The researchers highlight two instances where median CE aligns better with task outcomes. In the case of Qwen2.5-1.5B's supervised fine-tuning on synthetic fact-learning, mean CE significantly increases after the initial learning stage, while fact-recall accuracy remains stable. Additionally, in TinyStories' top-K distillation, reducing K enhances median CE but deteriorates mean CE; the Top-5 student, despite having the lowest mean CE, secures the best LLM-judge score and surpasses its teacher in median CE. Their analysis shows that training alters the empirical per-token CE distribution, with smaller K in top-K distillation creating a distribution that lowers median CE while raising the mean. These results imply that median CE may be a more dependable measure for assessing language model quality.
Key facts
- Mean cross-entropy is the standard validation metric for language models.
- Mean CE can fail to track model quality during training.
- Two scenarios examined: Qwen2.5-1.5B SFT on synthetic fact-learning and top-K distillation on TinyStories.
- In Qwen2.5-1.5B SFT, mean CE rises after initial learning while fact-recall accuracy stays high.
- In top-K distillation on TinyStories, decreasing K improves median CE but worsens mean CE.
- Top-5 student has highest LLM-judge score and crosses teacher on median CE despite worst mean CE.
- Median CE correlates more closely with task performance than mean CE in both cases.
- Training reshapes the empirical per-token CE distribution.
Entities
Institutions
- arXiv