Inverse Reinforcement Learning for Reasoning Rewards in LLMs

ai-technology · 2026-04-25

A novel framework named adversarial inverse reinforcement learning (AIRL) has been introduced to derive reasoning rewards for large language models (LLMs) directly from expert demonstrations, tackling the shortcomings of supervised fine-tuning (SFT) and outcome-based reinforcement learning (RL). This method assesses various reward granularities: sparse rewards focus on overall trajectory quality and stability in training, whereas denser rewards offer step-by-step guidance for pinpointing errors but pose optimization challenges. The rewards obtained act as training signals, frequently surpassing the performance of outcome-based RL. This framework is elaborated in the paper arXiv:2510.01857v3, available on arXiv.

Key facts

Proposes adversarial inverse reinforcement learning (AIRL) for reasoning rewards.
Learns rewards from expert demonstrations, not outcome-level verifiers.
Evaluates sparse, interval, and dense reward granularities.
Sparse rewards focus on global trajectory quality and stability.
Dense rewards offer step-level supervision but are harder to optimize.
Learned rewards are useful as training signals.
Outperforms outcome-based RL in many cases.
Paper available on arXiv with ID 2510.01857v3.

Inverse Reinforcement Learning for Reasoning Rewards in LLMs

Key facts

Entities

Institutions

Sources