LLM Toxicity Benchmarks Show Bias When Task Changes
A new study from arXiv (2605.10639) reveals that toxicity benchmarks for large language models (LLMs) are not robust. Researchers found that shifting the evaluation task from text completion to summarization significantly increases the likelihood of content being flagged as harmful. The work also identifies neglected intrinsic biases related to model choice, metrics, and task types. These discrepancies could lead to the deployment of unsafe systems if organizations rely solely on current benchmarks for certification in customer-facing applications and automated moderation.
Key facts
- Study published on arXiv with ID 2605.10639
- Investigates bias in LLM toxicity benchmarks
- Changing task from text completion to summarization increases harmful flagging
- Identifies biases in model choice, metrics, and task types
- Benchmarks show inconsistent behavior across input data domains
- Risks deploying vulnerable or unsafe systems if biases unrecognized
Entities
Institutions
- arXiv