SafetyALFRED Benchmark Reveals AI Safety Planning Deficits in Kitchen Environments

ai-technology · 2026-04-22

A new research benchmark called SafetyALFRED evaluates multimodal large language models' ability to address safety hazards in interactive environments. Built upon the existing ALFRED embodied agent benchmark, it incorporates six categories of real-world kitchen dangers. The study tested eleven state-of-the-art models from the Qwen, Gemma, and Gemini families, examining both hazard recognition and active risk mitigation through embodied planning. Results show a significant alignment gap: while models perform well at recognizing hazards in question-answering settings, their success rates for actually mitigating those risks remain comparatively low. This research demonstrates that static evaluations through QA are insufficient for assessing physical safety capabilities. The findings advocate for a paradigm shift toward more comprehensive safety assessments of AI systems operating in physical spaces. The paper was published on arXiv with identifier 2604.19638v1.

Key facts

SafetyALFRED is a new benchmark for evaluating AI safety planning
Built upon the ALFRED embodied agent benchmark
Incorporates six categories of real-world kitchen hazards
Tests eleven state-of-the-art models from Qwen, Gemma, and Gemini families
Evaluates both hazard recognition and active risk mitigation
Reveals significant gap between recognition and mitigation capabilities
Static QA evaluations insufficient for physical safety assessment
Advocates for paradigm shift in AI safety evaluation

SafetyALFRED Benchmark Reveals AI Safety Planning Deficits in Kitchen Environments

Key facts

Entities

Institutions

Sources