ANCORA: Self-Play Framework for Verifiable Reasoning Without Human Supervision
Researchers propose ANCORA, a novel framework that shifts from learning to answer to learning to question. The system alternates between a Proposer that generates novel specifications and a Solver that produces verified solutions, enabling self-improvement without human supervision. Key mechanisms include a two-level group-relative update coupling advantages, iterative self-distilled SFT projecting onto a valid-output manifold, and a UCB-guided Curriculum DAG that grows only through verified specifications. These stabilizers prevent Proposer collapse under sparse verifier feedback. The work is detailed in arXiv:2604.27644.
Key facts
- ANCORA is an anchored-curriculum framework for verifiable reasoning.
- It alternates between a Proposer and a Solver.
- Uses two-level group-relative update for advantages.
- Employs iterative self-distilled SFT and UCB-guided Curriculum DAG.
- Designed to prevent Proposer collapse from sparse feedback.
- Operates without human supervision.
- Published on arXiv with ID 2604.27644.
- Represents a paradigm shift from learning to answer to learning to question.
Entities
Institutions
- arXiv