Distribution-Aware Reward for LLM Regression
A novel reinforcement learning technique enhances large language models' ability to generate improved predictive distributions for regression tasks. Named Distribution-Aware Reward, this method employs the Continuous Ranked Probability Score to assess various decoded samples as an empirical distribution. It allocates credit according to the marginal contribution of each rollout to the quality of the distribution, incentivizing predictions that are not only precise but also well-calibrated. This innovation tackles the shortcomings of traditional training objectives that focus on optimizing point estimates, failing to guarantee calibrated predictive distributions.
Key facts
- Introduced Distribution-Aware Reward for LLM regression
- Uses on-policy reinforcement learning
- Evaluates multiple decoded samples as empirical predictive distribution
- Employs Continuous Ranked Probability Score
- Assigns leave-one-out credit based on marginal contribution
- Aims to improve predictive distribution calibration
- Addresses limitations of point estimate optimization
- Published on arXiv with ID 2605.20740
Entities
Institutions
- arXiv