Mini-BEHAVIOR-Gran Benchmark Reveals U-Shaped Performance Curve in AI Instruction Granularity
A team of researchers has rolled out a new standard called Mini-BEHAVIOR-Gran to more effectively explore how the detail of instructions affects language-driven embodied AI agents. This new approach enhances the Mini-BEHAVIOR framework by creating different types of instructions for each task, ranging from general goals to detailed, step-by-step guides. They assessed four aspects of instruction granularity: token count, action-verb count, entity count, and planning-width. Interestingly, they found a consistent link between planning-width and how well the agents performed. By structuring training and evaluation around this width, they observed a U-shaped trend in performance linked to instruction granularity, with the best results at both fine and coarse levels. This study is available in arXiv preprint 2604.17019v1, addressing a significant gap in how embodied AI is evaluated, as most existing benchmarks use just one static instruction per task.
Key facts
- Mini-BEHAVIOR-Gran is a new benchmark for studying instruction granularity in embodied AI
- The benchmark extends Mini-BEHAVIOR with multiple instruction variants per task
- Instruction variants range from high-level goal descriptions to step-by-step guidance
- Four metrics were compared for cross-task granularity quantification
- Planning-width correlated most consistently with agent performance
- A U-shaped relationship exists between instruction granularity and performance
- Performance peaks at both fine-grained and coarse-grained instruction levels
- Existing benchmarks typically use single static instructions per task
Entities
—