ARTFEED — Contemporary Art Intelligence

CPPO: Coordinated Pass@K Policy Optimization for Code Reasoning

other · 2026-05-27

The Coordinated Pass@K Policy Optimization (CPPO) introduces an innovative technique for enhancing code generation by synchronizing several sampling efforts. Unlike the traditional pass@K, which independently samples from a single distribution and often results in similar reasoning paths, CPPO employs a planner that generates a set of K=4 distinct high-level strategies. A common solver then tackles one solution for each strategy. The planner is optimized using a multiplicative reward R_plan = J_psi * R_out, crediting only those valid strategy tuples that yield a correct outcome. This method is tailored for competitive programming, where numerous problems can be approached with various distinct algorithmic techniques.

Key facts

  • CPPO turns pass@K generation into joint exploration over strategies.
  • Standard pass@K draws K independent samples from a single answer distribution.
  • Standard approach often collapses onto near-duplicate reasoning paths.
  • CPPO uses a planner to emit a tuple of K=4 alternative high-level methods.
  • A shared solver attempts one solution per method.
  • Planner reward is multiplicative: R_plan = J_psi * R_out.
  • Credit is assigned only to valid strategy tuples that lead to a correct solution.
  • CPPO is designed for competitive programming.

Entities

Institutions

  • arXiv

Sources