New Methods Localize and Suppress Toxicity in Language Models
Scientists have unveiled Meow2X and TRNE, two innovative frameworks that pinpoint toxicity within specific layers and neurons of large language models by examining activation differences between toxic and neutral prompts. Toxicity suppression is accomplished through inference-time scaling or slight rank-one weight adjustments, avoiding gradient descent. Assessments conducted on five language models, two benchmarks, and 90 configurations utilizing dual safety evaluators demonstrate a reliable decrease in toxicity while maintaining the quality of language modeling. The analysis indicates that early MLP layers disproportionately encode toxicity, it differs among architectures, and is often underestimated by evaluations relying on a single evaluator.
Key facts
- Meow2X and TRNE are retraining-free frameworks
- Localize toxicity to specific layers and neurons
- Suppress via inference-time scaling or rank-one weight edits
- Evaluated on five LMs, two benchmarks, 90 configurations
- Dual safety evaluators used
- Toxicity concentrated in early MLP layers
- Single-evaluator setups underestimate toxicity
- No gradient descent required
Entities
Institutions
- arXiv