SiNFluD: A Benchmark Dataset for Sindhi Figurative Language Classification

other · 2026-05-06

Researchers have introduced SiNFluD, a novel benchmark dataset for classifying figurative language in Sindhi. The dataset was compiled from blogs, social media, and literary sources, then annotated by two native speakers using Doccano, achieving an inter-annotator agreement of 0.81. Baseline results were established with 5-fold and 10-fold cross-validation. Among evaluated models—mBERT, XLM-RoBERTa, XLM-RoBERTa-XL, and SetFit for few-shot fine-tuning—XLM-RoBERTa-XL achieved the best performance.

Key facts

SiNFluD is a benchmark dataset for Sindhi figurative language classification.
Raw text was collected from blogs, social media platforms, and literary sources.
Two native annotators labeled the data using Doccano.
Inter-annotator agreement reached 0.81.
Baseline results used 5-fold and 10-fold cross-validation.
Models evaluated include mBERT, XLM-RoBERTa, XLM-RoBERTa-XL, and SetFit.
XLM-RoBERTa-XL achieved the best performance.
SetFit was used for few-shot fine-tuning of sentence transformers.

Entities

—

Sources

arXiv cs.AI — 2026-05-05