New Methods Localize and Suppress Toxicity in Language Models

ai-technology · 2026-05-28

Scientists have unveiled Meow2X and TRNE, two innovative frameworks that pinpoint toxicity within specific layers and neurons of large language models by examining activation differences between toxic and neutral prompts. Toxicity suppression is accomplished through inference-time scaling or slight rank-one weight adjustments, avoiding gradient descent. Assessments conducted on five language models, two benchmarks, and 90 configurations utilizing dual safety evaluators demonstrate a reliable decrease in toxicity while maintaining the quality of language modeling. The analysis indicates that early MLP layers disproportionately encode toxicity, it differs among architectures, and is often underestimated by evaluations relying on a single evaluator.

Key facts

Meow2X and TRNE are retraining-free frameworks
Localize toxicity to specific layers and neurons
Suppress via inference-time scaling or rank-one weight edits
Evaluated on five LMs, two benchmarks, 90 configurations
Dual safety evaluators used
Toxicity concentrated in early MLP layers
Single-evaluator setups underestimate toxicity
No gradient descent required

New Methods Localize and Suppress Toxicity in Language Models

Key facts

Entities

Institutions

Sources