BlueFin Benchmark Tests LLM Agents on Financial Spreadsheets

ai-technology · 2026-06-01

A new benchmark called BlueFin has been launched by researchers to assess large language model (LLM) agents on tasks related to spreadsheet work in the finance sector. This benchmark includes 131 intricate, real-world tasks and features 3,225 detailed rubric criteria. Importantly, a group of expert human annotators has validated the rubric criteria and evaluations from LM judges, guaranteeing high-quality assessments. This initiative aims to bridge the gap in LLM performance for spreadsheet-related tasks, especially considering that the number of paying spreadsheet users worldwide far exceeds that of professional developers.

Key facts

BlueFin is a benchmark for LLM agents on financial spreadsheet tasks.
It includes 131 tasks and 3,225 rubric criteria.
Evaluations are validated by expert human annotators.
The benchmark targets synthesis, manipulation, and comprehension tasks.
Global spreadsheet users outnumber professional developers by an order of magnitude.
Few resources have been devoted to LLM capabilities in the spreadsheet domain.
The benchmark aims to mirror real occupational tasks in professional finance.
The paper is available on arXiv with ID 2605.30907.

BlueFin Benchmark Tests LLM Agents on Financial Spreadsheets

Key facts

Entities

Institutions

Sources