MultiSoc-4D Benchmark Reveals Label Collapse in Bengali LLM Annotation

other · 2026-05-13

A new benchmark dataset, MultiSoc-4D, exposes a systematic flaw in how large language models handle closed-set annotation tasks for low-resource languages. The dataset comprises over 58,000 Bengali social media comments from six sources, annotated across four dimensions: category, sentiment, hate speech, and sarcasm. Researchers employed ChatGPT, Gemini, Claude, and Grok to annotate separate partitions, with a shared 20% validation set. They discovered 'instruction-induced label collapse,' where LLMs disproportionately select fallback labels such as Other, Neutral, or No, leading to high inter-model agreement but severe under-detection of minority categories. For instance, the models failed to detect 79% and 75% of instances with hate speech and sarcasm, respectively. This phenomenon undermines the reliability of automated annotation for low-resource languages and highlights the need for better instruction design. The study is published on arXiv under identifier 2605.06940.

Key facts

MultiSoc-4D is a Bengali social media benchmark with 58K+ comments.
Comments come from six sources and are annotated for category, sentiment, hate speech, and sarcasm.
LLMs used: ChatGPT, Gemini, Claude, and Grok.
20% of data served as a common validation set.
Instruction-induced label collapse causes preference for fallback labels (Other, Neutral, No).
LLMs failed to detect 79% of hate speech instances and 75% of sarcasm instances.
The study is published on arXiv (2605.06940).
The phenomenon reduces annotation reliability for low-resource languages.

MultiSoc-4D Benchmark Reveals Label Collapse in Bengali LLM Annotation

Key facts

Entities

Institutions

Sources