AInstein Framework Tests LLMs on AI Research Problems
A novel framework named AInstein assesses the capability of large language models to tackle AI research challenges solely based on parametric knowledge, without relying on fine-tuning, retrieval, or external resources. This investigation, released on arXiv, features a blind validation involving 20 domain experts assessing ICLR 2026 problems, subsequently extending to 1,214 papers from ICLR 2025 through an LLM-as-a-judge approach. Performance is evaluated using two criteria: Success Rate (does the solution resolve the issue?) and Rediscovery (does it align with the published method?). While LLMs achieve success in over 70% of cases, they replicate the published solution under 19% of the time, demonstrating true problem-solving capabilities. However, they struggle with innovative combinations or external knowledge.
Key facts
- AInstein framework tests LLMs on AI research problems using only parametric knowledge.
- Blind study with 20 domain experts on held-out ICLR 2026 problems.
- Scaled to 1,214 ICLR 2025 papers using LLM-as-a-judge paradigm.
- Two metrics: Success Rate and Rediscovery.
- LLMs succeed on over 70% of problems.
- Strict rediscovery rate less than 19%.
- Models fail on problems requiring novel combinations or external knowledge.
- Published on arXiv with ID 2510.05432.
Entities
Institutions
- arXiv
- ICLR