ClawForge: A Benchmark Framework for Command-Line Agent Workflows
ClawForge serves as a benchmark framework supported by generators, aimed at assessing interactive command-line agents in realistic scenarios that involve state conflicts. In contrast to traditional benchmarks that start tasks from a pristine state, ClawForge examines how agents manage pre-existing, partial, outdated, or conflicting elements. This framework integrates scenario templates, grounded slots, initialized states, reference trajectories, and validators to create reproducible task specifications. It assesses agents incrementally across persistent workflow surfaces, focusing on normalized end states and observable side effects instead of precise trajectory alignment. The research paper can be found on arXiv with the identifier 2605.14133.
Key facts
- ClawForge is a benchmark framework for command-line agents.
- It focuses on executable workflows under state conflict.
- Existing benchmarks initialize tasks from clean state.
- ClawForge tests handling of pre-existing partial, stale, or conflicting artifacts.
- The framework uses scenario templates, grounded slots, initialized state, reference trajectories, and validators.
- Evaluation uses normalized end state and observable side effects.
- The paper is on arXiv: 2605.14133.
- It addresses tension between scalable construction and realistic workflow evaluation.
Entities
Institutions
- arXiv