PTA-GRPO Enhances LLM Reasoning with High-Level Planning
A new framework called Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization (PTA-GRPO) has been introduced by researchers to enhance reasoning in large language models (LLMs). This framework operates in two phases: initially, it condenses Chain-of-Thought reasoning into concise, high-level instructions for supervised fine-tuning. Subsequently, it employs guidance-aware reinforcement learning to optimize the final output collectively. This approach tackles the challenges posed by token-level local decisions and the significant computational expenses associated with tree-based search and reinforcement learning techniques.
Key facts
- PTA-GRPO is a two-stage framework for LLM reasoning
- Stage 1: Summarizes CoT into high-level guidance for supervised fine-tuning
- Stage 2: Guidance-aware RL jointly optimizes final output
- Addresses token-level local decisions in LLMs
- Reduces computational costs compared to tree-based search and RL
- Published on arXiv: 2510.01833v2
- Focuses on improving reasoning trajectories
Entities
—