Median Cross-Entropy Outperforms Mean for Language Model Validation

publication · 2026-05-26

A recent preprint on arXiv (2605.24667) indicates that the mean cross-entropy (CE), commonly used to validate language models, may not accurately reflect a model's performance during training. The researchers highlight two instances where median CE aligns better with task outcomes. In the case of Qwen2.5-1.5B's supervised fine-tuning on synthetic fact-learning, mean CE significantly increases after the initial learning stage, while fact-recall accuracy remains stable. Additionally, in TinyStories' top-K distillation, reducing K enhances median CE but deteriorates mean CE; the Top-5 student, despite having the lowest mean CE, secures the best LLM-judge score and surpasses its teacher in median CE. Their analysis shows that training alters the empirical per-token CE distribution, with smaller K in top-K distillation creating a distribution that lowers median CE while raising the mean. These results imply that median CE may be a more dependable measure for assessing language model quality.

Key facts

Mean cross-entropy is the standard validation metric for language models.
Mean CE can fail to track model quality during training.
Two scenarios examined: Qwen2.5-1.5B SFT on synthetic fact-learning and top-K distillation on TinyStories.
In Qwen2.5-1.5B SFT, mean CE rises after initial learning while fact-recall accuracy stays high.
In top-K distillation on TinyStories, decreasing K improves median CE but worsens mean CE.
Top-5 student has highest LLM-judge score and crosses teacher on median CE despite worst mean CE.
Median CE correlates more closely with task performance than mean CE in both cases.
Training reshapes the empirical per-token CE distribution.

Median Cross-Entropy Outperforms Mean for Language Model Validation

Key facts

Entities

Institutions

Sources