MM-Eval Framework for Multimodal Summarization Evaluation
A new evaluation framework called MM-Eval has been developed by researchers for Multimodal Summarization with Multimodal Output (MSMO). Existing evaluation techniques typically analyze text quality, image-text alignment, and visual diversity in isolation through unimodal metrics, which do not adequately reflect their interconnections. MM-Eval comprises three key elements: text quality is assessed using OpenFActScore for factual accuracy and G-Eval for coherence, fluency, and relevance; image-text relevance is determined through an MLLM-as-a-judge methodology; and the diversity of image sets is measured with Truncated CLIP Entropy. This framework seeks to unify the evaluation process in MSMO.
Key facts
- MM-Eval is a unified evaluation framework for MSMO.
- Current MSMO evaluation is fragmented using unimodal metrics.
- MM-Eval integrates text quality, cross-modal alignment, and visual diversity.
- Text quality uses OpenFActScore and G-Eval.
- Image-text relevance uses MLLM-as-a-judge.
- Image-set diversity uses Truncated CLIP Entropy.
- The framework addresses fragmentation in multimodal evaluation.
Entities
Institutions
- arXiv