ProFIL: Probe-Filtered RL Reduces Reasoning Theater in LLMs
Researchers have introduced ProFIL (Probe-Filtered Reinforcement Learning), a method to reduce 'reasoning theater' in large language models. Reasoning theater refers to post-hoc rationalizations that appear deliberative but contribute nothing to correctness, wasting tokens and obscuring interpretability. ProFIL extends Group Relative Policy Optimization (GRPO) by training a multi-head attention probe once on a frozen base model to detect post-commitment steps from internal activations. During GRPO, rollouts exceeding a probe threshold have their advantage zeroed, suppressing theater while maintaining faithfulness. The probe uses verifier-derived labels without human annotation. The approach aims to reduce chain-of-thought length and increase faithfulness in a single, drop-in extension.
Key facts
- ProFIL stands for Probe-Filtered Reinforcement Learning.
- It targets 'reasoning theater' in chain-of-thought reasoning.
- A multi-head attention probe is trained once on a frozen base model.
- The probe detects post-commitment steps from internal activations.
- Rollouts exceeding a probe threshold have their advantage zeroed during GRPO.
- Verifier-derived labels are used without human annotation.
- The method reduces chain length and increases faithfulness.
- It is a drop-in extension to Group Relative Policy Optimization (GRPO).
Entities
Institutions
- arXiv