New Benchmark Tests LLM Reasoning via Black-Box Interaction

ai-technology · 2026-05-07

Researchers introduced the Oracle benchmark to evaluate large language models' reasoning through black-box environment interaction, where models must infer hidden functions by exploring input-output pairs. The benchmark includes 96 environments across 6 task types. OpenAI's o3 model ranked first in 5 of 6 tasks among 19 tested LLMs.