ICRL: Joint Training of Solver and Critic via Reinforcement Learning

ai-technology · 2026-05-18

The ICRL framework (Internalizing Self-Critique with Reinforcement Learning) simultaneously trains a solver and a critic using a common backbone, transforming success driven by critique into independent solver capabilities. The critic receives rewards tied to the solver's improvement in performance, promoting constructive feedback. To tackle the distribution shift between behavior influenced by critique and that which is not, ICRL employs a distribution-calibration re-weighting ratio. This method seeks to allow agents based on large language models to assimilate critique guidance without needing external feedback during testing.

Key facts

ICRL stands for Learning to Internalize Self-Critique with Reinforcement Learning
The framework jointly trains a solver and a critic from a shared backbone
The critic is rewarded based on the solver's subsequent performance gain
ICRL introduces a distribution-calibration re-weighting ratio
The approach addresses distribution shift between critique-conditioned and critique-free behavior
The goal is to convert critique-induced success into unassisted solver ability
The paper is available on arXiv with ID 2605.15224
The publication date is not specified in the abstract

ICRL: Joint Training of Solver and Critic via Reinforcement Learning

Key facts

Entities

Institutions

Sources