ARTFEED — Contemporary Art Intelligence

AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

ai-technology · 2026-05-09

A new reinforcement learning method, Asymmetric Group Policy Optimization (AGPO), is proposed to improve reasoning in large language models (LLMs) while preserving exploration capacity. Current RLVR methods enhance sampling efficiency but narrow reasoning boundaries compared to base models. AGPO uses a negative-dominant strategy to suppress incorrect paths and a group advantage mechanism for positive updates, focusing on rare correct paths. The paper is published on arXiv under ID 2605.05826, with applications at JD.com for search ads relevance.

Key facts

  • arXiv ID: 2605.05826
  • Announce type: new
  • RLVR methods improve sampling efficiency but narrow reasoning boundaries
  • AGPO uses negative-dominant reinforcement to suppress incorrect paths
  • AGPO uses group advantage mechanism for positive updates
  • AGPO maintains base model's exploration capacity
  • Application at JD.com for search ads relevance
  • Published on arXiv

Entities

Institutions

  • JD.com
  • arXiv

Sources