FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis in Kitchens

other · 2026-05-26

FoodMonitor has been launched by researchers as a benchmark to assess multimodal large language models (MLLMs) in the realm of explainable compliance analysis within commercial kitchen surveillance. This dataset includes 477 video clips annotated with 3,307 violations, utilizing a dual-channel framework that addresses both person-level and environment-level infractions. Each annotation details the specific rule breached, the nature of the non-compliant action, and the individual responsible, complete with frame-level bounding boxes. A comprehensive evaluation protocol employs a two-stage matching system to evaluate spatial localization and semantic comprehension. This initiative fills a gap in current video anomaly detection datasets, which primarily focus on binary classification of events, and seeks to enhance AI-driven compliance monitoring in public governance and industrial safety through verifiable evidence and accountability signals.

Key facts

FoodMonitor is a benchmark for explainable compliance analysis in commercial kitchen surveillance.
It comprises 477 video clips with 3,307 violation annotations.
The dataset covers person-level and environment-level violations.
Each annotation includes the violated rule, non-compliant behavior, and who committed it with frame-level bounding boxes.
A unified evaluation protocol with a two-stage matching mechanism is established.
The two stages separately assess spatial localization and semantic understanding.
Existing video anomaly detection datasets focus on event-level binary classification.
The benchmark aims to provide verifiable evidence and traceable accountability signals.

Entities

—

Sources

arXiv cs.AI — 2026-05-26