Adaptive Commitment Depth Boosts Long-Horizon Vision-Language Reasoning
A new method from arXiv (2605.09860) treats commitment depth—the number of actions taken before replanning—as a learnable, state-conditioned variable within a vision-language policy. This adaptive approach outperforms fixed-depth baselines on Sliding Puzzle and Sokoban, achieving up to 12.5 percentage points higher solve rates while using about 25% fewer primitive actions per episode. The system uses a 7B backbone and surpasses GPT-4o in performance.
Key facts
- Commitment depth is formalized as the number of primitive actions executed open-loop between replans.
- Most existing long-horizon systems fix commitment depth as a hand-designed scalar.
- The proposed method treats commitment depth as a learnable, state-conditioned variable of the policy.
- The adaptive policy is instantiated within a model-native vision-language policy.
- The method Pareto-dominates every non-degenerate fixed-depth baseline on Sliding Puzzle and Sokoban.
- It achieves up to 12.5 percentage points higher solve rate.
- It uses approximately 25% fewer primitive actions per episode.
- The system uses a 7B backbone and outperforms GPT-4o.
Entities
Institutions
- arXiv