EAPO: Entropy-Driven Weighting for RL in Open-Ended QA

other · 2026-05-28

A new paper on arXiv (2605.27846) introduces EAPO, an Entropy-driven Adaptive Policy Optimization method for reinforcement learning in open-ended question answering. The authors systematically investigate the roles of positive and negative samples, finding that negative samples govern response diversity and performance upper bound, while positive samples determine response quality and convergence stability. EAPO adaptively computes weighting coefficients for positive samples based on the ratio of current policy entropy to initial entropy. The work addresses limitations of existing RLVR approaches that use fixed weights and fail to generalize to open-ended QA.

Key facts

arXiv paper 2605.27846
EAPO: Entropy-driven Adaptive Policy Optimization
Focuses on open-ended question answering
Negative samples govern diversity and upper bound
Positive samples determine quality and stability
Adaptive weighting based on entropy ratio
Addresses fixed-weight limitations in RLVR
Published on arXiv

EAPO: Entropy-Driven Weighting for RL in Open-Ended QA

Key facts

Entities

Institutions

Sources