New Framework Quantifies Divergence in LLM API Reasoning
A new benchmarking framework has been created to assess how well large language models perform in identifying and ranking APIs for similar tasks. Researchers looked at 15 different API domains from 5 model families, using various metrics like Average Overlap and Jaccard similarity. The results showed a moderate overall agreement, with an Average Overlap around 0.50 and Kendall’s tau close to 0.45. However, the study highlighted significant differences based on the domain: structured tasks, such as Weather and Speech-to-Text, were more consistent, while open-ended tasks like Sentiment Analysis showed more variation. This research is available in paper 2604.22760 on arXiv.
Key facts
- Framework quantifies inter-LLM divergence in API discovery and ranking
- 15 canonical API domains tested
- 5 major model families evaluated
- Metrics include Average Overlap, Jaccard similarity, Rank-Biased Overlap, Kendall's tau, Kendall's W, Cronbach's alpha
- Overall agreement moderate: AO ~0.50, tau ~0.45
- Structured tasks (Weather, Speech-to-Text) show stability
- Open-ended tasks (Sentiment Analysis) show higher divergence
- Published on arXiv as paper 2604.22760
Entities
Institutions
- arXiv