ARTFEED — Contemporary Art Intelligence

Median Cross-Entropy Outperforms Mean for Language Model Validation

publication · 2026-05-26

A recent preprint on arXiv (2605.24667) indicates that the mean cross-entropy (CE), commonly used to validate language models, may not accurately reflect a model's performance during training. The researchers highlight two instances where median CE aligns better with task outcomes. In the case of Qwen2.5-1.5B's supervised fine-tuning on synthetic fact-learning, mean CE significantly increases after the initial learning stage, while fact-recall accuracy remains stable. Additionally, in TinyStories' top-K distillation, reducing K enhances median CE but deteriorates mean CE; the Top-5 student, despite having the lowest mean CE, secures the best LLM-judge score and surpasses its teacher in median CE. Their analysis shows that training alters the empirical per-token CE distribution, with smaller K in top-K distillation creating a distribution that lowers median CE while raising the mean. These results imply that median CE may be a more dependable measure for assessing language model quality.

Key facts

  • Mean cross-entropy is the standard validation metric for language models.
  • Mean CE can fail to track model quality during training.
  • Two scenarios examined: Qwen2.5-1.5B SFT on synthetic fact-learning and top-K distillation on TinyStories.
  • In Qwen2.5-1.5B SFT, mean CE rises after initial learning while fact-recall accuracy stays high.
  • In top-K distillation on TinyStories, decreasing K improves median CE but worsens mean CE.
  • Top-5 student has highest LLM-judge score and crosses teacher on median CE despite worst mean CE.
  • Median CE correlates more closely with task performance than mean CE in both cases.
  • Training reshapes the empirical per-token CE distribution.

Entities

Institutions

  • arXiv

Sources