SMARTER: Data-Efficient Framework for Explainable Toxicity Detection Using LLMs
Researchers have unveiled SMARTER, a two-stage framework designed for explainable content moderation that is data-efficient and utilizes Large Language Models (LLMs). In the first stage, the framework leverages the outputs of LLMs to create synthetic explanations for both accurate and inaccurate labels, facilitating alignment through preference optimization with limited human input. The second stage enhances the quality of explanations via cross-model training, enabling less powerful models to stylistically and semantically align with their more robust counterparts. Testing on benchmarks like HateXplain, Latent Hate, and Implicit Hate demonstrated a macro-F1 improvement of up to 13% over few-shot baselines while utilizing only a small portion of the complete training data. This framework presents a scalable approach for low-resource environments.
Key facts
- SMARTER is a two-stage framework for explainable content moderation.
- Stage 1 uses LLM outputs to generate synthetic explanations for alignment.
- Stage 2 uses cross-model training to refine explanation quality.
- Experiments conducted on HateXplain, Latent Hate, and Implicit Hate benchmarks.
- Achieves up to 13% macro-F1 improvement over standard few-shot baselines.
- Uses only a fraction of the full training data.
- Aims to address proliferation of toxic content on social media.
- Framework is data-efficient and scalable for low-resource settings.
Entities
Institutions
- arXiv