Rollout Cards: A New Standard for Reproducibility in Agent Research
A recent study published on arXiv introduces the concept of 'rollout cards' as a standard for reproducibility in agent research, tackling the issue that reported scores frequently mask the actual rollout records. The researchers examined 50 widely-used training and evaluation repositories, discovering that none included failures, errors, or skips alongside their main scores. They identified 37 instances where variations in reporting guidelines could significantly alter task-success rates, cost/token calculations, or timing metrics for fixed evidence. The authors contend that the focus should be on rollout records rather than reported scores as the basis for reproducibility. Rollout cards serve as publication bundles that maintain the rollout records associated with the scores, allowing for thorough inspection and verification.
Key facts
- Paper on arXiv proposes 'rollout cards' for agent research reproducibility.
- Audit of 50 repositories found none report failures, errors, or skips.
- 37 cases documented where reporting rules can change metrics dramatically.
- Rollout cards preserve rollout records as the unit of reproducibility.
Entities
Institutions
- arXiv