ROPD: Rubric-Based On-Policy Distillation for LLM Alignment
Researchers have introduced ROPD, a framework for rubric-based on-policy distillation (OPD) that substitutes teacher logits with structured semantic rubrics to enhance model alignment. By deriving prompt-specific rubrics from comparisons between teacher and student outputs, ROPD evaluates student rollouts for on-policy enhancement. This method facilitates OPD in black-box environments where only responses from the teacher are accessible. ROPD has been shown to surpass sophisticated logit-based OPD techniques in most situations, achieving as much as a 10x improvement in sample efficiency. The framework serves as a versatile, black-box-friendly alternative to logit-based OPD, establishing a straightforward yet robust baseline for scalable distillation in both proprietary and open-source LLMs. The code is accessible on arXiv.
Key facts
- ROPD is a rubric-based on-policy distillation framework.
- It uses structured semantic rubrics instead of teacher logits.
- Rubrics are induced from teacher-student contrasts.
- ROPD scores student rollouts for on-policy optimization.
- Outperforms advanced logit-based OPD methods in most scenarios.
- Achieves up to 10x gain in sample efficiency.
- Enables OPD in black-box scenarios.
- Code available on arXiv.
Entities
Institutions
- arXiv