Inverse Reinforcement Learning for Reasoning Rewards in LLMs
A novel framework named adversarial inverse reinforcement learning (AIRL) has been introduced to derive reasoning rewards for large language models (LLMs) directly from expert demonstrations, tackling the shortcomings of supervised fine-tuning (SFT) and outcome-based reinforcement learning (RL). This method assesses various reward granularities: sparse rewards focus on overall trajectory quality and stability in training, whereas denser rewards offer step-by-step guidance for pinpointing errors but pose optimization challenges. The rewards obtained act as training signals, frequently surpassing the performance of outcome-based RL. This framework is elaborated in the paper arXiv:2510.01857v3, available on arXiv.
Key facts
- Proposes adversarial inverse reinforcement learning (AIRL) for reasoning rewards.
- Learns rewards from expert demonstrations, not outcome-level verifiers.
- Evaluates sparse, interval, and dense reward granularities.
- Sparse rewards focus on global trajectory quality and stability.
- Dense rewards offer step-level supervision but are harder to optimize.
- Learned rewards are useful as training signals.
- Outperforms outcome-based RL in many cases.
- Paper available on arXiv with ID 2510.01857v3.
Entities
Institutions
- arXiv