MULTITEXTEDIT: A Cross-Lingual Benchmark for Text-in-Image Editing
A new benchmark called MULTITEXTEDIT has been launched by researchers to assess text-in-image editing in various languages. This benchmark features 3,600 examples across 12 diverse languages, 5 visual domains, and 7 editing techniques. Each example is based on a shared visual foundation and includes a reference edited by humans along with region masks to separate language aspects. To tackle script-level inaccuracies, such as absent diacritics or incorrect RTL order, the creators devised a language fidelity (LSF) metric utilizing a two-stage LVM protocol, achieving a quadratic-weighted kappa of 0.76 compared to native speakers. This research underscores the English-centric bias in current benchmarks and seeks to enhance cross-lingual semantic accuracy in visual content production.
Key facts
- MULTITEXTEDIT is a benchmark for cross-lingual text-in-image editing.
- It includes 3,600 instances across 12 languages.
- Covers 5 visual domains and 7 editing operations.
- Each instance has a human-edited reference and region masks.
- Introduces a language fidelity (LSF) metric.
- LSF uses a two-stage LVM protocol.
- Achieves quadratic-weighted kappa of 0.76 against native speakers.
- Addresses English-centric bias in existing benchmarks.
Entities
—