CUA-Gym: Scalable RL Training for Computer-Use Agents
A team of researchers has introduced CUA-Gym, a scalable framework designed to simultaneously generate task instructions, environment states, and reward functions for reinforcement learning with verifiable rewards (RLVR) in computer-use agents (CUAs). The framework employs a Generator agent to create both initial and optimal environment states, a Discriminator agent to formulate reward functions based on task specifications, and an orchestrator agent to manage the execution process. This innovation tackles the challenge of limited scalable training data that offers deterministic rewards for CUAs, achieving a combination of high reward accuracy and extensive scalability. The findings are published on arXiv (2605.25624).
Key facts
- CUA-Gym is a scalable pipeline for generating RLVR training data for CUAs.
- It co-generates task instructions, environment states, and reward functions.
- A Generator agent constructs initial and golden environment states.
- A Discriminator agent writes reward functions from task specifications.
- An orchestrator agent drives iterative rounds upon execution.
- The approach addresses the scarcity of scalable training data with deterministic rewards.
- Hand-curated benchmarks offer high reward fidelity but limited applications.
- LLM-as-judge datasets scale broadly but lack reliable verification.
- The paper is available on arXiv with ID 2605.25624.
Entities
Institutions
- arXiv