New Benchmark Tests LLM Reasoning via Black-Box Interaction
Researchers introduced the Oracle benchmark to evaluate large language models' reasoning through black-box environment interaction, where models must infer hidden functions by exploring input-output pairs. The benchmark includes 96 environments across 6 task types. OpenAI's o3 model ranked first in 5 of 6 tasks among 19 tested LLMs.
Key facts
- Oracle benchmark comprises 6 types of black-box tasks with 96 environments
- 19 modern LLMs were benchmarked
- OpenAI's o3 ranked first in 5 of 6 tasks
- Black-box environment defined by hidden function mapping inputs to outputs
- LLMs required to unravel hidden function through interaction and reasoning
Entities
Institutions
- OpenAI
- arXiv