ARTFEED — Contemporary Art Intelligence

AInstein Framework Tests LLMs on AI Research Problems

ai-technology · 2026-04-30

A novel framework named AInstein assesses the capability of large language models to tackle AI research challenges solely based on parametric knowledge, without relying on fine-tuning, retrieval, or external resources. This investigation, released on arXiv, features a blind validation involving 20 domain experts assessing ICLR 2026 problems, subsequently extending to 1,214 papers from ICLR 2025 through an LLM-as-a-judge approach. Performance is evaluated using two criteria: Success Rate (does the solution resolve the issue?) and Rediscovery (does it align with the published method?). While LLMs achieve success in over 70% of cases, they replicate the published solution under 19% of the time, demonstrating true problem-solving capabilities. However, they struggle with innovative combinations or external knowledge.

Key facts

  • AInstein framework tests LLMs on AI research problems using only parametric knowledge.
  • Blind study with 20 domain experts on held-out ICLR 2026 problems.
  • Scaled to 1,214 ICLR 2025 papers using LLM-as-a-judge paradigm.
  • Two metrics: Success Rate and Rediscovery.
  • LLMs succeed on over 70% of problems.
  • Strict rediscovery rate less than 19%.
  • Models fail on problems requiring novel combinations or external knowledge.
  • Published on arXiv with ID 2510.05432.

Entities

Institutions

  • arXiv
  • ICLR

Sources