Symbolic Inputs Boost LLM Performance on Abstract Visual Reasoning
A recent study published on arXiv (2604.21346) explores whether vision-language models (VLMs) struggle with abstract visual reasoning due to issues in reasoning or representation. Researchers utilized the Bongard-LOGO benchmark to assess end-to-end VLMs with raw images against large language models (LLMs) that received symbolic inputs derived from these images. The study introduced the Componential-Grammatical (C-G) approach, transforming Bongard-LOGO into a symbolic reasoning challenge through LOGO-style action programs or structured descriptions. LLMs demonstrated mid-90s accuracy on Free-form problems, whereas a robust visual baseline performed near chance under equivalent task definitions. Further investigations into input formats, explicit concept prompts, and minimal visual cues underscored representational bottlenecks.
Key facts
- Study compares VLMs on raw images with LLMs on symbolic inputs
- Uses Bongard-LOGO synthetic benchmark for abstract concept learning
- C-G paradigm reformulates benchmark as symbolic reasoning task
- LLMs reach mid-90s accuracy on Free-form problems
- Visual baseline remains near chance under matched definitions
- Ablations test input format, concept prompts, and visual cues
- Published on arXiv with ID 2604.21346
- Research highlights representational bottlenecks in VLMs
Entities
Institutions
- arXiv