Text Degeneration: A Hidden Cost in LLM Inference

other · 2026-05-22

A new study reveals that text degeneration—a self-reinforcing loop where autoregressive language models repeat tokens indefinitely—can inflate inference costs by over 40% even when affecting fewer than 3% of requests. The phenomenon, first formalized by Holtzman et al. in 2020, is structural: it arises from the maximum-likelihood training objective and cannot be fully mitigated by decoding strategies alone. In experiments with the Qwen2.5-VL-7B-Instruct model on OCR tasks, degenerate requests caused a 42.47% increase in total wall-clock time and raised the mean duration of healthy requests by up to 71%. The authors propose Direct Preference Optimization (DPO) with degenerate-rejected pairs as a structural fix, reducing degeneration rates by 37–87% across model families. They argue that degeneration rate should be a first-class metric in benchmarks, as standard evaluations overlook this failure mode and its operational impact.

Key facts

Text degeneration is a self-reinforcing failure mode of autoregressive language models.
Fewer than 3% of requests can consume nearly half of total wall-clock time.
Degenerate requests inflate total inference time by 42.47% in one experiment.
Healthy request durations rose by 15–71% when degenerate requests ran in parallel.
The phenomenon was first formalized by Holtzman et al. in 2020.
DPO with degenerate-rejected pairs reduced degeneration by 37–87% across model families.
The smallest specialized model (3B) achieved the lowest degeneration rate (0.20%).
Standard benchmarks do not track degeneration rate as a metric.

Entities

Institutions

HuggingFace
DharmaOCR
Qwen
Nanonets
arXiv

Sources

Hugging Face Blog — 2026-05-22