Verbal Process Supervision Boosts LLM Reasoning Without Training
There's this new method called Verbal Process Supervision (VPS) that boosts how well large language models think by introducing critiques from a more sophisticated model, and it doesn't need any extra training. For instance, on the GPQA Diamond benchmark, GPT-5.4 achieved an impressive 94.9% accuracy with a round budget of R=4, beating the previous best of 94.1% without any gradient tweaks. At AIME 2025, VPS really helps less powerful models, raising their performance from 11.7–26.7% to an amazing 63.3–90.0%, which is a jump of up to 63.3 points. When looking at similar computing power, VPS outperforms Reflexion by up to 12.1 points and Self-Consistency@5 by 5.0 points on GPQA, plus 8.3 points on LiveCodeBench V6.
Key facts
- VPS is a training-free framework using structured natural-language critique from a stronger supervisor.
- On GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) achieves 94.9% at R=4, surpassing 94.1% state of the art.
- On AIME 2025, VPS boosts weak model scores from 11.7-26.7% to 63.3-90.0% (up to +63.3 points).
- At matched compute, VPS outperforms Reflexion by +8.5 to +12.1 points.
- VPS outperforms Self-Consistency@5 by +5.0 pp on GPQA and +8.3 pp on LiveCodeBench V6.
- VPS introduces a fourth axis: granularity of external verbal supervision.
- Results cover GPQA Diamond, AIME 2025, and LiveCodeBench V6.
- VPS works with both closed and open models.
Entities
Institutions
- arXiv
- GPQA Diamond
- AIME 2025
- LiveCodeBench V6