EVOLM: Self-Improving Language Models via Co-Evolved Rubrics
A new method called EVOLM enables language models to self-improve by generating their own evaluation criteria. Current post-training methods rely on external supervision—human annotations, proprietary APIs, or scalar reward models—each with inherent limitations. Human judgment cannot surpass human capabilities, APIs create dependencies, and verifiable rewards only apply to domains with ground-truth answers. EVOLM structures a model's own evaluative capacity into explicit discriminative rubrics, which serve as training signals. The method alternates between training a rubric generator that produces instance-specific criteria optimized for discriminative utility, and using those rubrics to improve the model. This approach allows self-improvement that scales with the model itself, bypassing external ceilings. The paper is available on arXiv under reference 2605.03871.
Key facts
- EVOLM is a post-training method for language models.
- It uses self-generated discriminative rubrics as training signals.
- Current methods rely on human annotations, proprietary APIs, or scalar reward models.
- Human judgment cannot supervise capabilities beyond its own.
- Proprietary APIs create dependencies.
- Verifiable rewards only cover domains with ground-truth answers.
- EVOLM trains a rubric generator and the model in alternation.
- The paper is on arXiv: 2605.03871.
Entities
Institutions
- arXiv