AgentFloor Benchmark Tests Small Open-Weight Models on Tool Use
AgentFloor, a newly introduced benchmark, assesses the extent to which small open-weight language models can effectively utilize tools. This benchmark consists of a deterministic set of 30 tasks, structured into a six-tier capability ladder that includes instruction adherence, tool utilization, multi-step coordination, and long-term planning under ongoing constraints. Researchers evaluated 16 open-weight models, varying from 0.27B to 32B parameters, alongside GPT-5, across 16,542 scored attempts. The findings reveal a distinct threshold: small and mid-sized open-weight models are adequate for much of the short-horizon, structured tool use prevalent in actual agent workflows. The top-performing open-weight model performs on par with GPT-5 in aggregate for these tasks.
Key facts
- AgentFloor is a deterministic 30-task benchmark.
- Benchmark organized as a six-tier capability ladder.
- Tasks include instruction following, tool use, multi-step coordination, and long-horizon planning.
- Evaluated 16 open-weight models from 0.27B to 32B parameters.
- GPT-5 was also evaluated for comparison.
- Total of 16,542 scored runs were conducted.
- Small and mid-sized open-weight models are sufficient for short-horizon structured tool use.
- Strongest open-weight model matches GPT-5 in aggregate on the benchmark.
Entities
—