reward-lens: Open-Source Library for Reward Model Interpretability
Researchers have introduced a new open-source library called reward-lens, aimed at enhancing the mechanistic interpretability tools specifically for reward models in language models trained with reinforcement learning from human feedback (RLHF). Traditional interpretability methods, like logit lens and activation patching, were made for generative language models that use vocabulary unembedding. However, these techniques fall short with reward models that have a scalar regression head. The reward-lens library modifies these methods for better applicability, focusing on the reward head's weight vector for clarity. It includes features like a Reward Lens, component attribution, and a suite for probing reward hacking, among others. You can check out the related paper on arXiv under ID 2604.26130.
Key facts
- reward-lens is an open-source library for mechanistic interpretability of reward models.
- It adapts tools like logit lens, activation patching, and sparse autoencoders to reward models.
- Reward models use a scalar regression head instead of vocabulary unembedding.
- The library is organized around the reward head's weight vector as the natural interpretability axis.
- Includes a Reward Lens, component attribution, three-mode activation patching, and a reward-hacking probe suite.
- Offers TopK SAE feature attribution and cross-model comparison.
- Provides five theory-grounded extensions: distortion index, divergence-aware patching, misalignment cascade detection, reward-term conflict analysis, and concept-vector analysis.
- The paper is published on arXiv with ID 2604.26130.
Entities
Institutions
- arXiv