CPPO: First On-Policy Contrastive RL Algorithm for Discrete and Continuous Actions
A team of researchers has unveiled Contrastive Proximal Policy Optimisation (CPPO), marking it as the inaugural on-policy contrastive reinforcement learning algorithm. CPPO obtains policy advantages directly from contrastive Q-values and refines them using the traditional PPO objective, thus removing the necessity for a reward function or replay buffer. Current contrastive RL techniques are primarily off-policy and generally restricted to continuous action spaces. CPPO broadens the scope of contrastive RL to include on-policy training frameworks, accommodating both single-agent and multi-agent reinforcement learning across continuous and discrete settings. This research was published on arXiv under the identifier 2605.13554.
Key facts
- CPPO is an on-policy contrastive RL algorithm.
- It derives policy advantages from contrastive Q-values.
- Optimization uses the standard PPO objective.
- No reward function or replay buffer is required.
- Existing CRL algorithms are off-policy and mostly for continuous actions.
- CPPO works in both continuous and discrete action spaces.
- It supports single-agent and multi-agent RL.
- The paper is on arXiv: 2605.13554.
Entities
Institutions
- arXiv