ARTFEED — Contemporary Art Intelligence

CPPO: First On-Policy Contrastive RL Algorithm for Discrete and Continuous Actions

other · 2026-05-14

A team of researchers has unveiled Contrastive Proximal Policy Optimisation (CPPO), marking it as the inaugural on-policy contrastive reinforcement learning algorithm. CPPO obtains policy advantages directly from contrastive Q-values and refines them using the traditional PPO objective, thus removing the necessity for a reward function or replay buffer. Current contrastive RL techniques are primarily off-policy and generally restricted to continuous action spaces. CPPO broadens the scope of contrastive RL to include on-policy training frameworks, accommodating both single-agent and multi-agent reinforcement learning across continuous and discrete settings. This research was published on arXiv under the identifier 2605.13554.

Key facts

  • CPPO is an on-policy contrastive RL algorithm.
  • It derives policy advantages from contrastive Q-values.
  • Optimization uses the standard PPO objective.
  • No reward function or replay buffer is required.
  • Existing CRL algorithms are off-policy and mostly for continuous actions.
  • CPPO works in both continuous and discrete action spaces.
  • It supports single-agent and multi-agent RL.
  • The paper is on arXiv: 2605.13554.

Entities

Institutions

  • arXiv

Sources