CPPO: First On-Policy Contrastive RL Algorithm for Discrete and Continuous Actions

other · 2026-05-14

A team of researchers has unveiled Contrastive Proximal Policy Optimisation (CPPO), marking it as the inaugural on-policy contrastive reinforcement learning algorithm. CPPO obtains policy advantages directly from contrastive Q-values and refines them using the traditional PPO objective, thus removing the necessity for a reward function or replay buffer. Current contrastive RL techniques are primarily off-policy and generally restricted to continuous action spaces. CPPO broadens the scope of contrastive RL to include on-policy training frameworks, accommodating both single-agent and multi-agent reinforcement learning across continuous and discrete settings. This research was published on arXiv under the identifier 2605.13554.

Key facts

CPPO is an on-policy contrastive RL algorithm.
It derives policy advantages from contrastive Q-values.
Optimization uses the standard PPO objective.
No reward function or replay buffer is required.
Existing CRL algorithms are off-policy and mostly for continuous actions.
CPPO works in both continuous and discrete action spaces.
It supports single-agent and multi-agent RL.
The paper is on arXiv: 2605.13554.

CPPO: First On-Policy Contrastive RL Algorithm for Discrete and Continuous Actions

Key facts

Entities

Institutions

Sources