A* Post-Training Boosts LLM Reasoning Efficiency
A recent study published on arXiv (2605.24597) suggests enhancing deductive reasoning in large language models (LLMs) through A* search. The researchers conceptualize natural language inference as a search challenge, where a valid proof serves as the ultimate answer, necessitating accurate intermediate steps. They investigate supervised fine-tuning based on A* execution traces alongside reinforcement learning utilizing A*-informed reward models. Tests conducted on Llama-3.2 models (1B–3B parameters) demonstrate significant improvements, progressing from nearly zero accuracy to surpassing DeepSeek-V3.2, a considerably larger model. Their findings indicate a balance between straightforward correctness rewards and operational efficiency.
Key facts
- Paper arXiv:2605.24597 proposes A* post-training for LLM reasoning.
- Frames natural language inference as a search problem for valid proofs.
- Uses supervised fine-tuning on A* execution traces.
- Also uses reinforcement learning with A*-informed process reward models.
- Llama-3.2 models (1B–3B) improved from near-zero accuracy.
- Outperformed DeepSeek-V3.2, a much larger model.
- Trade-off between correctness rewards and efficiency identified.
- A* search guarantees optimally efficient path to goal.
Entities
Institutions
- arXiv
- DeepSeek