ARTFEED — Contemporary Art Intelligence

COSE: Confidence-Weighted PPO for Self-Evolving LLMs

ai-technology · 2026-05-28

A novel method called COSE (Confidence-Orchestrated Self-Evolution) has been introduced by researchers to enable self-evolving large language models (LLMs). This technique utilizes the model's inherent confidence as a subtle uncertainty indicator to enhance learning. COSE incorporates confidence-weighted PPO updates and confidence-prioritized replay, tackling the issue of training signals where incorrect self-assessments result in flawed gradient updates. It eliminates the need for external validators or considering noisy self-generated feedback as supervision. When tested on 19 held-out benchmarks and four Qwen/Llama backbones (ranging from 0.6B to 4B), COSE consistently outperforms base models, achieving the highest average performance in general reasoning and mathematics. The findings are available on arXiv under ID 2605.28010.

Key facts

  • COSE uses LLM's intrinsic confidence as uncertainty signal
  • Introduces confidence-weighted PPO updates
  • Introduces confidence-prioritized replay
  • Evaluated on 19 held-out benchmarks
  • Tested on Qwen/Llama backbones from 0.6B to 4B parameters
  • Addresses training-signal challenge from erroneous self-judgments
  • Avoids external verifiers and noisy self-generated feedback
  • Published on arXiv with ID 2605.28010

Entities

Institutions

  • arXiv

Sources