ARTFEED — Contemporary Art Intelligence

New Framework Quantifies Divergence in LLM API Reasoning

ai-technology · 2026-04-29

A new benchmarking framework has been created to assess how well large language models perform in identifying and ranking APIs for similar tasks. Researchers looked at 15 different API domains from 5 model families, using various metrics like Average Overlap and Jaccard similarity. The results showed a moderate overall agreement, with an Average Overlap around 0.50 and Kendall’s tau close to 0.45. However, the study highlighted significant differences based on the domain: structured tasks, such as Weather and Speech-to-Text, were more consistent, while open-ended tasks like Sentiment Analysis showed more variation. This research is available in paper 2604.22760 on arXiv.

Key facts

  • Framework quantifies inter-LLM divergence in API discovery and ranking
  • 15 canonical API domains tested
  • 5 major model families evaluated
  • Metrics include Average Overlap, Jaccard similarity, Rank-Biased Overlap, Kendall's tau, Kendall's W, Cronbach's alpha
  • Overall agreement moderate: AO ~0.50, tau ~0.45
  • Structured tasks (Weather, Speech-to-Text) show stability
  • Open-ended tasks (Sentiment Analysis) show higher divergence
  • Published on arXiv as paper 2604.22760

Entities

Institutions

  • arXiv

Sources