Study Evaluates 22 Agentic AI Frameworks on Reasoning Benchmarks

ai-technology · 2026-04-22

A recent empirical investigation assessed 22 popular agentic AI frameworks against three reasoning benchmarks: BBH, GSM8K, and ARC. These frameworks were chosen from a pool of 1,200 GitHub repositories gathered between January 2023 and July 2025, and were categorized based on their architectural designs. In a standardized evaluation environment, the researchers analyzed reasoning accuracy, execution duration, computational expenses, and consistency across benchmarks. Findings revealed that 19 out of the 22 frameworks successfully tackled all three benchmarks, with 12 exhibiting stable performance and mean accuracy between 74.6% and 75.9%. Task execution times ranged from 4 to 6 seconds, while computational costs varied from 0.14 to 0.18 cents per task. The study pinpointed orchestration issues as a major factor in subpar performance. This research fills a crucial gap in the comparative analysis of reasoning efficiency and practicality among agentic frameworks. The study, which highlights recent advancements in AI agents' capabilities for complex reasoning and decision-making, was published as arXiv:2604.16646v1, categorized as new.

Key facts

Study evaluated 22 agentic AI frameworks
Frameworks tested on BBH, GSM8K, and ARC benchmarks
12 frameworks showed stable 74.6-75.9% mean accuracy
Execution time ranged 4-6 seconds per task
Computational cost ranged 0.14-0.18 cents per task
Frameworks selected from 1,200 GitHub repositories
Data collected January 2023-July 2025
19 of 22 frameworks completed all three benchmarks

Study Evaluates 22 Agentic AI Frameworks on Reasoning Benchmarks

Key facts

Entities

Institutions

Sources