Manga109-v2026 Dataset Corrects 29,000 Annotation Errors for AI Research

digital · 2026-05-22

Manga109-v2026 has been introduced by researchers as an updated iteration of the essential Manga109 dataset, which is crucial for AI applications in manga comprehension. The initial dataset, frequently utilized for OCR and multimodal projects, suffered from transcription inaccuracies, absent text areas, overlapping dialogue and sound effects, as well as improperly segmented speech balloons. The research team pinpointed five types of annotation flaws and employed a combination of OCR detection and manual corrections to amend around 29,000 dialogue entries. This enhanced dataset is designed to better integrate with contemporary OCR technologies and multimodal manga comprehension frameworks. Given that manga is a vital aspect of Japanese pop culture, this revision guarantees the dataset's continued relevance for AI studies in translation, text recognition, and content analysis.

Key facts

Manga109-v2026 revises approximately 29,000 dialogue annotations.
Five categories of annotation issues were identified: transcription errors, missing text regions, overlapping dialogue and onomatopoeia, and under-segmented speech balloons.
The original Manga109 dataset is foundational for manga-related AI research.
Corrections combine OCR-based detection and manual revision.
Manga is a culturally distinctive multimodal medium and influential Japanese popular culture.
The revision aims to align with modern OCR and multimodal manga understanding tasks.
The dataset is used for AI systems targeting manga understanding, OCR, and translation.
The work is published on arXiv with ID 2605.21182.

Manga109-v2026 Dataset Corrects 29,000 Annotation Errors for AI Research

Key facts

Entities

Institutions

Sources