LLM Toxicity Benchmarks Show Bias When Task Changes

ai-technology · 2026-05-12

A new study from arXiv (2605.10639) reveals that toxicity benchmarks for large language models (LLMs) are not robust. Researchers found that shifting the evaluation task from text completion to summarization significantly increases the likelihood of content being flagged as harmful. The work also identifies neglected intrinsic biases related to model choice, metrics, and task types. These discrepancies could lead to the deployment of unsafe systems if organizations rely solely on current benchmarks for certification in customer-facing applications and automated moderation.

Key facts

Study published on arXiv with ID 2605.10639
Investigates bias in LLM toxicity benchmarks
Changing task from text completion to summarization increases harmful flagging
Identifies biases in model choice, metrics, and task types
Benchmarks show inconsistent behavior across input data domains
Risks deploying vulnerable or unsafe systems if biases unrecognized

LLM Toxicity Benchmarks Show Bias When Task Changes

Key facts

Entities

Institutions

Sources