A2RBench: Automated Benchmark for LLM Abstract Reasoning
A new automated pipeline called A2RBench generates formally verifiable benchmarks for testing abstract reasoning in large language models (LLMs). The system uses LLMs to create diverse reasoning tasks, then expands them by reusing validated rules and generating new input spaces. To eliminate hallucinations, the pipeline employs programmatic verification through cycle consistency—testing whether an inverse operation reverses a forward operation. This approach aims to measure genuine reasoning rather than memorization, addressing limitations of existing benchmarks that rely on expensive manual annotation or risk testing memorization. The arXiv paper (2605.17278) details the generation, expansion, evaluation, and analysis stages.
Key facts
- A2RBench is an automated pipeline for generating abstract reasoning benchmarks
- It includes generation, expansion, evaluation, and analysis stages
- LLMs create diverse tasks requiring genuine reasoning
- Expansion reuses validated rules and expands input spaces
- Programmatic verification uses cycle consistency to eliminate hallucinations
- Cycle consistency tests if inverse operation reverses forward operation
- Addresses limitations of manual annotation and memorization risks
- Published on arXiv with ID 2605.17278
Entities
Institutions
- arXiv