Toxic Prompts Reduce LLM Factual Accuracy, Study Finds
A recent study published on arXiv (2605.30913) examines the impact of toxic language in prompts on the factual accuracy of large language models (LLMs). The researchers evaluated five LLMs using ARC-Easy, GSM8K, and MMLU with variations in prompts categorized as polite, random, and three levels of toxicity. Findings indicate that the introduction of toxic language consistently diminishes factual precision and heightens uncertainty, whereas polite language leads to minimal and inconsistent effects. Analyses of model activations and influences through attribution graphs demonstrate that heightened toxicity selectively enhances perturbation-sensitive variant nodes, while the stable core reasoning remains intact. This research underscores the dangers of utilizing LLMs in adversarial dialogue scenarios.
Key facts
- Study published on arXiv with ID 2605.30913
- Five LLMs evaluated on ARC-Easy, GSM8K, and MMLU
- Prompt variations included polite, random, and three toxicity levels
- Toxic perturbations consistently reduce factual accuracy
- Polite phrasing yields limited and inconsistent changes
- Attribution-graph analyses used to examine internal model changes
- Increasing toxicity amplifies perturbation-sensitive variant nodes
- Stable core reasoning remains under toxic prompts
Entities
Institutions
- arXiv