QSTRBench: New Benchmark Tests LLMs on Spatial and Temporal Reasoning
Researchers have created a new benchmark named QSTRBench to evaluate large language models (LLMs) specifically for qualitative spatial and temporal reasoning (QSTR). This benchmark includes questions focused on compositional reasoning, converse relations, and conceptual neighborhoods. It employs several calculi, such as Point Algebra (PA), Allen's Interval Algebra, and various Region Connection Calculi (RCC-5, RCC-8, RCC-22), along with others like the nine intersection model and cardinal direction calculus. A notable addition is the RCC-22 conceptual neighborhood. The benchmark also changes how questions are formatted, using different notations and descriptions. Tests on major models show that while they outperform random guessing, none reach full accuracy, with PA being the easiest and RCC-22 the most challenging.
Key facts
- QSTRBench evaluates LLMs on qualitative spatial and temporal reasoning.
- Includes calculi: PA, Allen's Interval Algebra, INDU, RCC-5, RCC-8, RCC-22, nine intersection model, cardinal direction calculus, STAR.
- RCC-22 conceptual neighborhood is published for the first time.
- Question presentation varies: prefix/infix, words/symbols/nonce terms, schematic descriptions.
- All tested models outperform guessing but none achieve perfect accuracy.
- Performance varies sharply by calculus; PA is easiest, RCC-22 hardest.
Entities
—