FoodCHA: A Multimodal LLM Agent for Fine-Grained Food Analysis

ai-technology · 2026-05-09

FoodCHA is a multimodal agentic framework proposed for fine-grained food analysis. It reformulates food recognition as a hierarchical decision-making process, progressively anchoring predictions from high-level categories to subcategories and cooking styles. The system addresses challenges in real-world food images such as high intra-class similarity and multiple food items per image. It improves upon deep learning models that struggle with fine-grained attributes and vision-language models that produce non-canonical labels. The framework aims to enable accurate dietary monitoring via camera-equipped mobile devices and wearables.

Key facts

FoodCHA is a multimodal agentic framework for food recognition.
It reformulates food recognition as a hierarchical decision-making process.
It guides subcategory identification using high-level categories.
It guides cooking style recognition using subcategories.
It addresses high intra-class similarity and multiple food items per image.
Deep learning models struggle with fine-grained attributes like cooking style.
Vision-language models can produce non-canonical labels.
The framework targets dietary monitoring via mobile devices and wearables.

FoodCHA: A Multimodal LLM Agent for Fine-Grained Food Analysis

Key facts

Entities

Institutions

Sources