FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis in Kitchens
FoodMonitor has been launched by researchers as a benchmark to assess multimodal large language models (MLLMs) in the realm of explainable compliance analysis within commercial kitchen surveillance. This dataset includes 477 video clips annotated with 3,307 violations, utilizing a dual-channel framework that addresses both person-level and environment-level infractions. Each annotation details the specific rule breached, the nature of the non-compliant action, and the individual responsible, complete with frame-level bounding boxes. A comprehensive evaluation protocol employs a two-stage matching system to evaluate spatial localization and semantic comprehension. This initiative fills a gap in current video anomaly detection datasets, which primarily focus on binary classification of events, and seeks to enhance AI-driven compliance monitoring in public governance and industrial safety through verifiable evidence and accountability signals.
Key facts
- FoodMonitor is a benchmark for explainable compliance analysis in commercial kitchen surveillance.
- It comprises 477 video clips with 3,307 violation annotations.
- The dataset covers person-level and environment-level violations.
- Each annotation includes the violated rule, non-compliant behavior, and who committed it with frame-level bounding boxes.
- A unified evaluation protocol with a two-stage matching mechanism is established.
- The two stages separately assess spatial localization and semantic understanding.
- Existing video anomaly detection datasets focus on event-level binary classification.
- The benchmark aims to provide verifiable evidence and traceable accountability signals.
Entities
—