ARTFEED — Contemporary Art Intelligence

EAPO: Entropy-Driven Weighting for RL in Open-Ended QA

other · 2026-05-28

A new paper on arXiv (2605.27846) introduces EAPO, an Entropy-driven Adaptive Policy Optimization method for reinforcement learning in open-ended question answering. The authors systematically investigate the roles of positive and negative samples, finding that negative samples govern response diversity and performance upper bound, while positive samples determine response quality and convergence stability. EAPO adaptively computes weighting coefficients for positive samples based on the ratio of current policy entropy to initial entropy. The work addresses limitations of existing RLVR approaches that use fixed weights and fail to generalize to open-ended QA.

Key facts

  • arXiv paper 2605.27846
  • EAPO: Entropy-driven Adaptive Policy Optimization
  • Focuses on open-ended question answering
  • Negative samples govern diversity and upper bound
  • Positive samples determine quality and stability
  • Adaptive weighting based on entropy ratio
  • Addresses fixed-weight limitations in RLVR
  • Published on arXiv

Entities

Institutions

  • arXiv

Sources