METIS: Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
A novel framework named METIS (METacognitive Internalized Self-judgment) has been proposed for curriculum learning in LLM Reinforcement Fine-Tuning (RFT). Unlike existing techniques that rely on external heuristics or auxiliary models for curriculum evaluation, METIS incorporates this judgment as an inherent feature. It capitalizes on the insight that the variance in rewards within prompts serves as an effective measure of prompt informativeness, predicting this from recent training results as simple in-context learning instances. This self-judgment influences training distribution dynamically. Additionally, METIS integrates judgment with optimization by concurrently enhancing standard RFT rewards alongside a self-judgment reward, allowing the policy to determine its next learning focus through metacognition. The research is available on arXiv under ID 2605.11235.
Key facts
- METIS internalizes curriculum judgment as a native capability for LLM RFT.
- Current methods externalize curriculum judgment via heuristics or auxiliary models.
- Within-prompt reward variance gauges prompt informativeness.
- METIS predicts this metric from recent training outcomes as in-context learning examples.
- Self-judgment dynamically dictates training allocation.
- METIS jointly optimizes standard RFT rewards and a self-judgment reward.
- The policy learns what to learn next as metacognition.
- Paper available at arXiv:2605.11235.
Entities
Institutions
- arXiv