TraceLift: Training Reasoning Planners with Executor-Grounded Rewards

ai-technology · 2026-05-07

The newly developed reinforcement learning framework, TraceLift, tackles the shortcomings of relying solely on the correctness of final answers to enhance reasoning in large language models. This method views reasoning as a temporary, consumable product. In a planner-executor framework, a planner generates tagged reasoning traces, which are then processed by a static executor to create a final output for verification feedback. A reward grounded in executor performance amplifies a rubric-based Reasoning Reward Model (RM) score by the observed improvement on the same executor, rewarding traces that enhance executor efficiency. This approach seeks to generate accurate and dependable reasoning beneficial for downstream models, steering clear of shortcuts and erroneous intermediate outputs. The paper can be found on arXiv with ID 2605.03862.

Key facts

TraceLift is a planner-executor training framework for large language models.
It uses executor-grounded rewards to shape intermediate reasoning traces.
The reward multiplies a rubric-based Reasoning Reward Model score by uplift on a frozen executor.
The approach treats reasoning as a consumable artifact for downstream models.
It aims to produce faithful and reliable reasoning traces.
The paper is on arXiv with ID 2605.03862.
The method addresses limitations of outcome-only reward signals.
It avoids shortcuts and flawed intermediate states in multi-step systems.

TraceLift: Training Reasoning Planners with Executor-Grounded Rewards

Key facts

Entities

Institutions

Sources