ARTFEED — Contemporary Art Intelligence

Study Evaluates 22 Agentic AI Frameworks on Reasoning Benchmarks

ai-technology · 2026-04-22

A recent empirical investigation assessed 22 popular agentic AI frameworks against three reasoning benchmarks: BBH, GSM8K, and ARC. These frameworks were chosen from a pool of 1,200 GitHub repositories gathered between January 2023 and July 2025, and were categorized based on their architectural designs. In a standardized evaluation environment, the researchers analyzed reasoning accuracy, execution duration, computational expenses, and consistency across benchmarks. Findings revealed that 19 out of the 22 frameworks successfully tackled all three benchmarks, with 12 exhibiting stable performance and mean accuracy between 74.6% and 75.9%. Task execution times ranged from 4 to 6 seconds, while computational costs varied from 0.14 to 0.18 cents per task. The study pinpointed orchestration issues as a major factor in subpar performance. This research fills a crucial gap in the comparative analysis of reasoning efficiency and practicality among agentic frameworks. The study, which highlights recent advancements in AI agents' capabilities for complex reasoning and decision-making, was published as arXiv:2604.16646v1, categorized as new.

Key facts

  • Study evaluated 22 agentic AI frameworks
  • Frameworks tested on BBH, GSM8K, and ARC benchmarks
  • 12 frameworks showed stable 74.6-75.9% mean accuracy
  • Execution time ranged 4-6 seconds per task
  • Computational cost ranged 0.14-0.18 cents per task
  • Frameworks selected from 1,200 GitHub repositories
  • Data collected January 2023-July 2025
  • 19 of 22 frameworks completed all three benchmarks

Entities

Institutions

  • GitHub
  • arXiv

Sources