ARTFEED — Contemporary Art Intelligence

New Benchmark Tests LLM Reasoning via Black-Box Interaction

ai-technology · 2026-05-07

Researchers introduced the Oracle benchmark to evaluate large language models' reasoning through black-box environment interaction, where models must infer hidden functions by exploring input-output pairs. The benchmark includes 96 environments across 6 task types. OpenAI's o3 model ranked first in 5 of 6 tasks among 19 tested LLMs.

Key facts

  • Oracle benchmark comprises 6 types of black-box tasks with 96 environments
  • 19 modern LLMs were benchmarked
  • OpenAI's o3 ranked first in 5 of 6 tasks
  • Black-box environment defined by hidden function mapping inputs to outputs
  • LLMs required to unravel hidden function through interaction and reasoning

Entities

Institutions

  • OpenAI
  • arXiv

Sources