RLAAR: Curriculum RL Reduces Lost-in-Conversation in LLMs
A new framework called RLAAR (Reinforcement Learning with Verifiable Accuracy and Abstention Rewards) addresses the Lost-in-Conversation (LiC) problem in large language models, where performance degrades as information is revealed across multiple turns. The approach uses a competence-gated curriculum that gradually increases dialogue difficulty, combined with a mixed-reward system that encourages correct answers and informed abstention when questions are unsolvable. RLAAR employs multi-turn on-policy rollouts to train models to balance problem-solving with abstention, reducing premature answering. The work is motivated by progress in Reinforcement Learning with Verifiable Rewards (RLVR) and aims to improve reliability in multi-turn conversations.
Key facts
- RLAAR stands for Reinforcement Learning with Verifiable Accuracy and Abstention Rewards.
- It addresses Lost-in-Conversation (LiC) in large language models.
- The framework uses a competence-gated curriculum that incrementally increases dialogue difficulty.
- It employs a mixed-reward system for correct answers and abstention.
- Multi-turn on-policy rollouts are used for training.
- The goal is to reduce premature answering behavior.
- The work is motivated by progress in RLVR.
- The paper is available on arXiv with ID 2510.18731.
Entities
—