PTA-GRPO Enhances LLM Reasoning with High-Level Planning

ai-technology · 2026-05-27

A new framework called Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization (PTA-GRPO) has been introduced by researchers to enhance reasoning in large language models (LLMs). This framework operates in two phases: initially, it condenses Chain-of-Thought reasoning into concise, high-level instructions for supervised fine-tuning. Subsequently, it employs guidance-aware reinforcement learning to optimize the final output collectively. This approach tackles the challenges posed by token-level local decisions and the significant computational expenses associated with tree-based search and reinforcement learning techniques.

Key facts

PTA-GRPO is a two-stage framework for LLM reasoning
Stage 1: Summarizes CoT into high-level guidance for supervised fine-tuning
Stage 2: Guidance-aware RL jointly optimizes final output
Addresses token-level local decisions in LLMs
Reduces computational costs compared to tree-based search and RL
Published on arXiv: 2510.01833v2
Focuses on improving reasoning trajectories

Entities

—

Sources

arXiv cs.AI — 2026-05-27