ARTFEED — Contemporary Art Intelligence

SafetyALFRED Benchmark Reveals AI Safety Planning Deficits in Kitchen Environments

ai-technology · 2026-04-22

A new research benchmark called SafetyALFRED evaluates multimodal large language models' ability to address safety hazards in interactive environments. Built upon the existing ALFRED embodied agent benchmark, it incorporates six categories of real-world kitchen dangers. The study tested eleven state-of-the-art models from the Qwen, Gemma, and Gemini families, examining both hazard recognition and active risk mitigation through embodied planning. Results show a significant alignment gap: while models perform well at recognizing hazards in question-answering settings, their success rates for actually mitigating those risks remain comparatively low. This research demonstrates that static evaluations through QA are insufficient for assessing physical safety capabilities. The findings advocate for a paradigm shift toward more comprehensive safety assessments of AI systems operating in physical spaces. The paper was published on arXiv with identifier 2604.19638v1.

Key facts

  • SafetyALFRED is a new benchmark for evaluating AI safety planning
  • Built upon the ALFRED embodied agent benchmark
  • Incorporates six categories of real-world kitchen hazards
  • Tests eleven state-of-the-art models from Qwen, Gemma, and Gemini families
  • Evaluates both hazard recognition and active risk mitigation
  • Reveals significant gap between recognition and mitigation capabilities
  • Static QA evaluations insufficient for physical safety assessment
  • Advocates for paradigm shift in AI safety evaluation

Entities

Institutions

  • arXiv

Sources