Text Degeneration: A Hidden Cost in LLM Inference
A new study reveals that text degeneration—a self-reinforcing loop where autoregressive language models repeat tokens indefinitely—can inflate inference costs by over 40% even when affecting fewer than 3% of requests. The phenomenon, first formalized by Holtzman et al. in 2020, is structural: it arises from the maximum-likelihood training objective and cannot be fully mitigated by decoding strategies alone. In experiments with the Qwen2.5-VL-7B-Instruct model on OCR tasks, degenerate requests caused a 42.47% increase in total wall-clock time and raised the mean duration of healthy requests by up to 71%. The authors propose Direct Preference Optimization (DPO) with degenerate-rejected pairs as a structural fix, reducing degeneration rates by 37–87% across model families. They argue that degeneration rate should be a first-class metric in benchmarks, as standard evaluations overlook this failure mode and its operational impact.
Key facts
- Text degeneration is a self-reinforcing failure mode of autoregressive language models.
- Fewer than 3% of requests can consume nearly half of total wall-clock time.
- Degenerate requests inflate total inference time by 42.47% in one experiment.
- Healthy request durations rose by 15–71% when degenerate requests ran in parallel.
- The phenomenon was first formalized by Holtzman et al. in 2020.
- DPO with degenerate-rejected pairs reduced degeneration by 37–87% across model families.
- The smallest specialized model (3B) achieved the lowest degeneration rate (0.20%).
- Standard benchmarks do not track degeneration rate as a metric.
Entities
Institutions
- HuggingFace
- DharmaOCR
- Qwen
- Nanonets
- arXiv