New Framework Quantifies Divergence in LLM API Reasoning

ai-technology · 2026-04-29

A new benchmarking framework has been created to assess how well large language models perform in identifying and ranking APIs for similar tasks. Researchers looked at 15 different API domains from 5 model families, using various metrics like Average Overlap and Jaccard similarity. The results showed a moderate overall agreement, with an Average Overlap around 0.50 and Kendall’s tau close to 0.45. However, the study highlighted significant differences based on the domain: structured tasks, such as Weather and Speech-to-Text, were more consistent, while open-ended tasks like Sentiment Analysis showed more variation. This research is available in paper 2604.22760 on arXiv.

Key facts

Framework quantifies inter-LLM divergence in API discovery and ranking
15 canonical API domains tested
5 major model families evaluated
Metrics include Average Overlap, Jaccard similarity, Rank-Biased Overlap, Kendall's tau, Kendall's W, Cronbach's alpha
Overall agreement moderate: AO ~0.50, tau ~0.45
Structured tasks (Weather, Speech-to-Text) show stability
Open-ended tasks (Sentiment Analysis) show higher divergence
Published on arXiv as paper 2604.22760

New Framework Quantifies Divergence in LLM API Reasoning

Key facts

Entities

Institutions

Sources