Causal Intervention Reduces Multiple Biases in Reward Models

ai-technology · 2026-05-01

Researchers propose a causally motivated inference-time intervention to debias reward models (RMs) used for aligning large language models (LLMs) with human preferences. The method identifies neurons whose activations correlate with predefined bias attributes, such as response length, and suppresses these signals through neuron-level intervention. Evaluated on RM benchmarks, the approach reduces sensitivity to multiple spurious features without performance trade-offs. When applied to small RMs (2B and 7B parameters), editing less than 2% of neurons enables improved preference annotation. The work is detailed in arXiv preprint 2604.27495.

Key facts

Reward models are sensitive to spurious features like response length.
Existing inference-time debiasing methods focus only on response length and cause trade-offs.
The proposed method uses causally motivated neuron-level intervention.
Neurons strongly correlated with bias attributes are identified and suppressed.
Evaluation shows reduced sensitivity to diverse bias types without performance trade-offs.
Small RMs (2B and 7B) with the method edit less than 2% of neurons.
The method improves preference annotation for LLMs.
The research is published on arXiv with ID 2604.27495.

Causal Intervention Reduces Multiple Biases in Reward Models

Key facts

Entities

Institutions

Sources