MultiSoc-4D Benchmark Reveals Label Collapse in Bengali LLM Annotation
A new benchmark dataset, MultiSoc-4D, exposes a systematic flaw in how large language models handle closed-set annotation tasks for low-resource languages. The dataset comprises over 58,000 Bengali social media comments from six sources, annotated across four dimensions: category, sentiment, hate speech, and sarcasm. Researchers employed ChatGPT, Gemini, Claude, and Grok to annotate separate partitions, with a shared 20% validation set. They discovered 'instruction-induced label collapse,' where LLMs disproportionately select fallback labels such as Other, Neutral, or No, leading to high inter-model agreement but severe under-detection of minority categories. For instance, the models failed to detect 79% and 75% of instances with hate speech and sarcasm, respectively. This phenomenon undermines the reliability of automated annotation for low-resource languages and highlights the need for better instruction design. The study is published on arXiv under identifier 2605.06940.
Key facts
- MultiSoc-4D is a Bengali social media benchmark with 58K+ comments.
- Comments come from six sources and are annotated for category, sentiment, hate speech, and sarcasm.
- LLMs used: ChatGPT, Gemini, Claude, and Grok.
- 20% of data served as a common validation set.
- Instruction-induced label collapse causes preference for fallback labels (Other, Neutral, No).
- LLMs failed to detect 79% of hate speech instances and 75% of sarcasm instances.
- The study is published on arXiv (2605.06940).
- The phenomenon reduces annotation reliability for low-resource languages.
Entities
Institutions
- arXiv