TopBench Benchmark Tests LLMs on Implicit Table Reasoning
TopBench has been launched by researchers as a benchmark designed to assess large language models in implicit prediction and reasoning within tabular question answering. This benchmark features 779 samples divided into four distinct sub-tasks: single-point prediction, decision making, treatment effect analysis, and complex filtering. Models are tasked with producing outputs that include both reasoning text and structured tables. Evaluations conducted through text-based and agentic workflows reveal that existing models frequently have difficulty recognizing intent in these predictive scenarios.
Key facts
- TopBench is a benchmark for implicit prediction and reasoning over tabular question answering.
- It contains 779 samples across four sub-tasks.
- Sub-tasks include single-point prediction, decision making, treatment effect analysis, and complex filtering.
- Models must generate outputs spanning reasoning text and structured tables.
- Evaluations were conducted under text-based and agentic workflows.
- Current models often struggle with intent recognition.
- The benchmark addresses queries requiring inference from historical patterns.
- The research is published on arXiv with ID 2604.28076.
Entities
—