A2RBench: Automated Benchmark for LLM Abstract Reasoning

ai-technology · 2026-05-20

A new automated pipeline called A2RBench generates formally verifiable benchmarks for testing abstract reasoning in large language models (LLMs). The system uses LLMs to create diverse reasoning tasks, then expands them by reusing validated rules and generating new input spaces. To eliminate hallucinations, the pipeline employs programmatic verification through cycle consistency—testing whether an inverse operation reverses a forward operation. This approach aims to measure genuine reasoning rather than memorization, addressing limitations of existing benchmarks that rely on expensive manual annotation or risk testing memorization. The arXiv paper (2605.17278) details the generation, expansion, evaluation, and analysis stages.

Key facts

A2RBench is an automated pipeline for generating abstract reasoning benchmarks
It includes generation, expansion, evaluation, and analysis stages
LLMs create diverse tasks requiring genuine reasoning
Expansion reuses validated rules and expands input spaces
Programmatic verification uses cycle consistency to eliminate hallucinations
Cycle consistency tests if inverse operation reverses forward operation
Addresses limitations of manual annotation and memorization risks
Published on arXiv with ID 2605.17278

A2RBench: Automated Benchmark for LLM Abstract Reasoning

Key facts

Entities

Institutions

Sources