New Metric Measures Lexical Diversity Loss in LLM Decoding

ai-technology · 2026-05-27

Researchers have introduced the Word Coverage Score (WCS), a metric that quantifies how standard sampling filters like Top-p, Top-k, and Min-p suppress low-frequency, high-information words in large language models (LLMs). The study, published on arXiv (2605.27268), audits open-weight models on human-authored corpus fragments to measure lexical survival rates. Findings provide quantitative evidence that decoding mechanics, rather than model knowledge alone, contribute to repetitive and homogeneous text generation. The WCS assesses which contextually appropriate human words become unreachable due to mathematical pruning, even when they exist in the probability space.

Key facts

Word Coverage Score (WCS) introduced as a metric
Measures lexical survival rate of low-frequency words
Audits open-weight models on human-authored corpus
Focuses on decoding mechanics (Top-p, Top-k, Min-p)
Published on arXiv with ID 2605.27268
Addresses criticism of LLM repetitive text
Quantifies suppression of linguistic diversity
Shows words unreachable despite being in probability space

New Metric Measures Lexical Diversity Loss in LLM Decoding

Key facts

Entities

Institutions

Sources