CPPO: Coordinated Pass@K Policy Optimization for Code Reasoning

other · 2026-05-27

The Coordinated Pass@K Policy Optimization (CPPO) introduces an innovative technique for enhancing code generation by synchronizing several sampling efforts. Unlike the traditional pass@K, which independently samples from a single distribution and often results in similar reasoning paths, CPPO employs a planner that generates a set of K=4 distinct high-level strategies. A common solver then tackles one solution for each strategy. The planner is optimized using a multiplicative reward R_plan = J_psi * R_out, crediting only those valid strategy tuples that yield a correct outcome. This method is tailored for competitive programming, where numerous problems can be approached with various distinct algorithmic techniques.

Key facts

CPPO turns pass@K generation into joint exploration over strategies.
Standard pass@K draws K independent samples from a single answer distribution.
Standard approach often collapses onto near-duplicate reasoning paths.
CPPO uses a planner to emit a tuple of K=4 alternative high-level methods.
A shared solver attempts one solution per method.
Planner reward is multiplicative: R_plan = J_psi * R_out.
Credit is assigned only to valid strategy tuples that lead to a correct solution.
CPPO is designed for competitive programming.

CPPO: Coordinated Pass@K Policy Optimization for Code Reasoning

Key facts

Entities

Institutions

Sources