COSE: Confidence-Weighted PPO for Self-Evolving LLMs

ai-technology · 2026-05-28

A novel method called COSE (Confidence-Orchestrated Self-Evolution) has been introduced by researchers to enable self-evolving large language models (LLMs). This technique utilizes the model's inherent confidence as a subtle uncertainty indicator to enhance learning. COSE incorporates confidence-weighted PPO updates and confidence-prioritized replay, tackling the issue of training signals where incorrect self-assessments result in flawed gradient updates. It eliminates the need for external validators or considering noisy self-generated feedback as supervision. When tested on 19 held-out benchmarks and four Qwen/Llama backbones (ranging from 0.6B to 4B), COSE consistently outperforms base models, achieving the highest average performance in general reasoning and mathematics. The findings are available on arXiv under ID 2605.28010.

Key facts

COSE uses LLM's intrinsic confidence as uncertainty signal
Introduces confidence-weighted PPO updates
Introduces confidence-prioritized replay
Evaluated on 19 held-out benchmarks
Tested on Qwen/Llama backbones from 0.6B to 4B parameters
Addresses training-signal challenge from erroneous self-judgments
Avoids external verifiers and noisy self-generated feedback
Published on arXiv with ID 2605.28010

COSE: Confidence-Weighted PPO for Self-Evolving LLMs

Key facts

Entities

Institutions

Sources