A* Post-Training Boosts LLM Reasoning Efficiency

ai-technology · 2026-05-26

A recent study published on arXiv (2605.24597) suggests enhancing deductive reasoning in large language models (LLMs) through A* search. The researchers conceptualize natural language inference as a search challenge, where a valid proof serves as the ultimate answer, necessitating accurate intermediate steps. They investigate supervised fine-tuning based on A* execution traces alongside reinforcement learning utilizing A*-informed reward models. Tests conducted on Llama-3.2 models (1B–3B parameters) demonstrate significant improvements, progressing from nearly zero accuracy to surpassing DeepSeek-V3.2, a considerably larger model. Their findings indicate a balance between straightforward correctness rewards and operational efficiency.

Key facts

Paper arXiv:2605.24597 proposes A* post-training for LLM reasoning.
Frames natural language inference as a search problem for valid proofs.
Uses supervised fine-tuning on A* execution traces.
Also uses reinforcement learning with A*-informed process reward models.
Llama-3.2 models (1B–3B) improved from near-zero accuracy.
Outperformed DeepSeek-V3.2, a much larger model.
Trade-off between correctness rewards and efficiency identified.
A* search guarantees optimally efficient path to goal.

A* Post-Training Boosts LLM Reasoning Efficiency

Key facts

Entities

Institutions

Sources