AgentFloor Benchmark Tests Small Open-Weight Models on Tool Use

ai-technology · 2026-05-04

AgentFloor, a newly introduced benchmark, assesses the extent to which small open-weight language models can effectively utilize tools. This benchmark consists of a deterministic set of 30 tasks, structured into a six-tier capability ladder that includes instruction adherence, tool utilization, multi-step coordination, and long-term planning under ongoing constraints. Researchers evaluated 16 open-weight models, varying from 0.27B to 32B parameters, alongside GPT-5, across 16,542 scored attempts. The findings reveal a distinct threshold: small and mid-sized open-weight models are adequate for much of the short-horizon, structured tool use prevalent in actual agent workflows. The top-performing open-weight model performs on par with GPT-5 in aggregate for these tasks.

Key facts

AgentFloor is a deterministic 30-task benchmark.
Benchmark organized as a six-tier capability ladder.
Tasks include instruction following, tool use, multi-step coordination, and long-horizon planning.
Evaluated 16 open-weight models from 0.27B to 32B parameters.
GPT-5 was also evaluated for comparison.
Total of 16,542 scored runs were conducted.
Small and mid-sized open-weight models are sufficient for short-horizon structured tool use.
Strongest open-weight model matches GPT-5 in aggregate on the benchmark.

Entities

—

Sources

arXiv cs.AI — 2026-05-04