ARTFEED — Contemporary Art Intelligence

Claw-Eval-Live: A Live Benchmark for Evolving Workflow Agents

other · 2026-05-01

Claw-Eval-Live has been launched by researchers as a dynamic benchmark intended to assess LLM agents within changing real-world workflows. In contrast to traditional static benchmarks that lock task sets upon release, Claw-Eval-Live distinguishes between a refreshable signal layer and a reproducible release snapshot. Public workflow-demand signals, including the ClawHub Top-500 skills, are utilized to update the signal layer with each release. Each iteration features controlled tasks with established fixtures, services, workspaces, and graders. For evaluation, the benchmark captures execution traces, audit logs, service states, and artifacts from post-run workspaces, using deterministic checks when adequate evidence is present. This methodology seeks to evaluate agents' proficiency in executing end-to-end tasks across various software tools and business services, tackling the challenge of adapting to evolving workflow requirements.

Key facts

  • Claw-Eval-Live is a live benchmark for workflow agents.
  • It separates a refreshable signal layer from a reproducible release snapshot.
  • The signal layer is updated from public workflow-demand signals.
  • ClawHub Top-500 skills are used in the current release.
  • Tasks are materialized with fixed fixtures, services, workspaces, and graders.
  • Grading records execution traces, audit logs, service state, and post-run workspace artifacts.
  • Deterministic checks are used when evidence is sufficient.
  • The benchmark evaluates LLM agents on end-to-end units of work.

Entities

Institutions

  • arXiv

Sources