ARTFEED — Contemporary Art Intelligence

FoodCHA: A Multimodal LLM Agent for Fine-Grained Food Analysis

ai-technology · 2026-05-09

FoodCHA is a multimodal agentic framework proposed for fine-grained food analysis. It reformulates food recognition as a hierarchical decision-making process, progressively anchoring predictions from high-level categories to subcategories and cooking styles. The system addresses challenges in real-world food images such as high intra-class similarity and multiple food items per image. It improves upon deep learning models that struggle with fine-grained attributes and vision-language models that produce non-canonical labels. The framework aims to enable accurate dietary monitoring via camera-equipped mobile devices and wearables.

Key facts

  • FoodCHA is a multimodal agentic framework for food recognition.
  • It reformulates food recognition as a hierarchical decision-making process.
  • It guides subcategory identification using high-level categories.
  • It guides cooking style recognition using subcategories.
  • It addresses high intra-class similarity and multiple food items per image.
  • Deep learning models struggle with fine-grained attributes like cooking style.
  • Vision-language models can produce non-canonical labels.
  • The framework targets dietary monitoring via mobile devices and wearables.

Entities

Institutions

  • arXiv

Sources