LLM Self-Explanations vs Human Rationales in Text Classification

other · 2026-05-22

A study compares self-explanations generated by instruction-tuned LLMs with human rationales for text classification tasks. The research evaluates plausibility and faithfulness across sentiment classification, forced labor detection, and claim verification. Human rationale annotations were collected for the Climate-Fever dataset. Danish and Italian translations of the sentiment task were included. The study also incorporates post-hoc attribution-based explanations to extend the analysis.

Key facts

Instruction-tuned LLMs can generate self-explanations without complex interpretability techniques.
The study evaluates self-explanations as input rationales for plausibility to humans.
Three text classification tasks are studied: sentiment classification, forced labor detection, and claim verification.
Danish and Italian translations of the sentiment classification task are included.
Human rationale annotations were collected for the Climate-Fever claim verification dataset.
The faithfulness of human and self-explanation rationales is evaluated with respect to correct model predictions.
The study extends by incorporating post-hoc attribution-based explanations.

LLM Self-Explanations vs Human Rationales in Text Classification

Key facts

Entities

Institutions

Sources