ARTFEED — Contemporary Art Intelligence

PTA-GRPO Enhances LLM Reasoning with High-Level Planning

ai-technology · 2026-05-27

A new framework called Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization (PTA-GRPO) has been introduced by researchers to enhance reasoning in large language models (LLMs). This framework operates in two phases: initially, it condenses Chain-of-Thought reasoning into concise, high-level instructions for supervised fine-tuning. Subsequently, it employs guidance-aware reinforcement learning to optimize the final output collectively. This approach tackles the challenges posed by token-level local decisions and the significant computational expenses associated with tree-based search and reinforcement learning techniques.

Key facts

  • PTA-GRPO is a two-stage framework for LLM reasoning
  • Stage 1: Summarizes CoT into high-level guidance for supervised fine-tuning
  • Stage 2: Guidance-aware RL jointly optimizes final output
  • Addresses token-level local decisions in LLMs
  • Reduces computational costs compared to tree-based search and RL
  • Published on arXiv: 2510.01833v2
  • Focuses on improving reasoning trajectories

Entities

Sources