Toxic Prompts Reduce LLM Factual Accuracy, Study Finds

ai-technology · 2026-06-01

A recent study published on arXiv (2605.30913) examines the impact of toxic language in prompts on the factual accuracy of large language models (LLMs). The researchers evaluated five LLMs using ARC-Easy, GSM8K, and MMLU with variations in prompts categorized as polite, random, and three levels of toxicity. Findings indicate that the introduction of toxic language consistently diminishes factual precision and heightens uncertainty, whereas polite language leads to minimal and inconsistent effects. Analyses of model activations and influences through attribution graphs demonstrate that heightened toxicity selectively enhances perturbation-sensitive variant nodes, while the stable core reasoning remains intact. This research underscores the dangers of utilizing LLMs in adversarial dialogue scenarios.

Key facts

Study published on arXiv with ID 2605.30913
Five LLMs evaluated on ARC-Easy, GSM8K, and MMLU
Prompt variations included polite, random, and three toxicity levels
Toxic perturbations consistently reduce factual accuracy
Polite phrasing yields limited and inconsistent changes
Attribution-graph analyses used to examine internal model changes
Increasing toxicity amplifies perturbation-sensitive variant nodes
Stable core reasoning remains under toxic prompts

Toxic Prompts Reduce LLM Factual Accuracy, Study Finds

Key facts

Entities

Institutions

Sources