FoodCHA: A Multimodal LLM Agent for Fine-Grained Food Analysis
FoodCHA is a multimodal agentic framework proposed for fine-grained food analysis. It reformulates food recognition as a hierarchical decision-making process, progressively anchoring predictions from high-level categories to subcategories and cooking styles. The system addresses challenges in real-world food images such as high intra-class similarity and multiple food items per image. It improves upon deep learning models that struggle with fine-grained attributes and vision-language models that produce non-canonical labels. The framework aims to enable accurate dietary monitoring via camera-equipped mobile devices and wearables.
Key facts
- FoodCHA is a multimodal agentic framework for food recognition.
- It reformulates food recognition as a hierarchical decision-making process.
- It guides subcategory identification using high-level categories.
- It guides cooking style recognition using subcategories.
- It addresses high intra-class similarity and multiple food items per image.
- Deep learning models struggle with fine-grained attributes like cooking style.
- Vision-language models can produce non-canonical labels.
- The framework targets dietary monitoring via mobile devices and wearables.
Entities
Institutions
- arXiv