CPPO: Coordinated Pass@K Policy Optimization for Code Reasoning
The Coordinated Pass@K Policy Optimization (CPPO) introduces an innovative technique for enhancing code generation by synchronizing several sampling efforts. Unlike the traditional pass@K, which independently samples from a single distribution and often results in similar reasoning paths, CPPO employs a planner that generates a set of K=4 distinct high-level strategies. A common solver then tackles one solution for each strategy. The planner is optimized using a multiplicative reward R_plan = J_psi * R_out, crediting only those valid strategy tuples that yield a correct outcome. This method is tailored for competitive programming, where numerous problems can be approached with various distinct algorithmic techniques.
Key facts
- CPPO turns pass@K generation into joint exploration over strategies.
- Standard pass@K draws K independent samples from a single answer distribution.
- Standard approach often collapses onto near-duplicate reasoning paths.
- CPPO uses a planner to emit a tuple of K=4 alternative high-level methods.
- A shared solver attempts one solution per method.
- Planner reward is multiplicative: R_plan = J_psi * R_out.
- Credit is assigned only to valid strategy tuples that lead to a correct solution.
- CPPO is designed for competitive programming.
Entities
Institutions
- arXiv