Multilingual Framework Detects Reclaimed Slurs in LGBTQ+ Discourse
A novel multi-stage approach has been introduced for identifying reclaimed slurs in multilingual social media contexts. This system distinguishes between reclamatory and non-reclamatory uses of LGBTQ+-related slurs in tweets written in English, Spanish, and Italian. It tackles issues such as limited data, class imbalance, and variations in sentiment across languages. The framework employs cross-validation for model selection, back-translation for semantic-preserving augmentation, dynamic epoch-level undersampling for inductive transfer learning, and masked language modeling for incorporating domain-specific knowledge. Eight multilingual embedding models were assessed, leading to the choice of XLM-RoBERTa as the foundational model based on its macro-averaged F1 score. Additionally, data augmentation using GPT-4o-mini back-translation effectively tripled the training dataset.
Key facts
- Framework detects reclaimed slurs in multilingual social media
- Focuses on LGBTQ+-related slurs in English, Spanish, and Italian
- Addresses data scarcity, class imbalance, cross-linguistic variation
- Uses cross-validation, back-translation, transfer learning, masked language modeling
- XLM-RoBERTa selected as foundation model
- GPT-4o-mini back-translation tripled training corpus
- Evaluated eight multilingual embedding models
- Published on arXiv under ID 2605.13415
Entities
Institutions
- arXiv