MULTITEXTEDIT: A Cross-Lingual Benchmark for Text-in-Image Editing

ai-technology · 2026-05-12

A new benchmark called MULTITEXTEDIT has been launched by researchers to assess text-in-image editing in various languages. This benchmark features 3,600 examples across 12 diverse languages, 5 visual domains, and 7 editing techniques. Each example is based on a shared visual foundation and includes a reference edited by humans along with region masks to separate language aspects. To tackle script-level inaccuracies, such as absent diacritics or incorrect RTL order, the creators devised a language fidelity (LSF) metric utilizing a two-stage LVM protocol, achieving a quadratic-weighted kappa of 0.76 compared to native speakers. This research underscores the English-centric bias in current benchmarks and seeks to enhance cross-lingual semantic accuracy in visual content production.

Key facts

MULTITEXTEDIT is a benchmark for cross-lingual text-in-image editing.
It includes 3,600 instances across 12 languages.
Covers 5 visual domains and 7 editing operations.
Each instance has a human-edited reference and region masks.
Introduces a language fidelity (LSF) metric.
LSF uses a two-stage LVM protocol.
Achieves quadratic-weighted kappa of 0.76 against native speakers.
Addresses English-centric bias in existing benchmarks.

Entities

—

Sources

arXiv cs.AI — 2026-05-12