ARTFEED — Contemporary Art Intelligence

MM-Eval Framework for Multimodal Summarization Evaluation

other · 2026-05-13

A new evaluation framework called MM-Eval has been developed by researchers for Multimodal Summarization with Multimodal Output (MSMO). Existing evaluation techniques typically analyze text quality, image-text alignment, and visual diversity in isolation through unimodal metrics, which do not adequately reflect their interconnections. MM-Eval comprises three key elements: text quality is assessed using OpenFActScore for factual accuracy and G-Eval for coherence, fluency, and relevance; image-text relevance is determined through an MLLM-as-a-judge methodology; and the diversity of image sets is measured with Truncated CLIP Entropy. This framework seeks to unify the evaluation process in MSMO.

Key facts

  • MM-Eval is a unified evaluation framework for MSMO.
  • Current MSMO evaluation is fragmented using unimodal metrics.
  • MM-Eval integrates text quality, cross-modal alignment, and visual diversity.
  • Text quality uses OpenFActScore and G-Eval.
  • Image-text relevance uses MLLM-as-a-judge.
  • Image-set diversity uses Truncated CLIP Entropy.
  • The framework addresses fragmentation in multimodal evaluation.

Entities

Institutions

  • arXiv

Sources