AcademiClaw Benchmark Tests AI on Real Student Tasks

other · 2026-05-06

A new benchmark called AcademiClaw evaluates AI agents on complex academic tasks sourced from university students. Unlike existing OpenClaw benchmarks that focus on assistant-level tasks, AcademiClaw targets academic-level capabilities. The benchmark comprises 80 bilingual tasks drawn from 230 student submissions, covering over 25 professional domains including olympiad-level mathematics, linguistics, GPU-intensive reinforcement learning, and full-stack system debugging. Sixteen tasks require CUDA GPU execution. Each task runs in an isolated Docker sandbox and is scored using multi-dimensional rubrics combining six complementary techniques. The tasks were curated through rigorous expert review from real student workflows such as homework, research projects, competitions, and personal projects that current AI agents struggle to solve. The work is published on arXiv under identifier 2605.02661.

Key facts

AcademiClaw is a bilingual benchmark of 80 complex tasks.
Tasks come from real university student academic workflows.
230 student-submitted candidates were curated through expert review.
Tasks span 25+ professional domains.
16 tasks require CUDA GPU execution.
Each task runs in an isolated Docker sandbox.
Scoring uses multi-dimensional rubrics with six complementary techniques.
Published on arXiv with ID 2605.02661.

AcademiClaw Benchmark Tests AI on Real Student Tasks

Key facts

Entities

Institutions

Sources