Comparative Study Evaluates Explainability Techniques for Large Language Models

ai-technology · 2026-04-20

A comparative analysis of three explainability methods for large language models was conducted, focusing on their practical application rather than proposing new techniques. The study examined Integrated Gradients, Attention Rollout, and SHAP using a fine-tuned DistilBERT model for SST-2 sentiment classification tasks. Gradient-based attribution methods demonstrated superior stability and produced more intuitive explanations according to the findings. Attention-based approaches were found to be computationally efficient but less aligned with features relevant to predictions. Model-agnostic techniques offered flexibility but came with higher computational costs and greater variability in results. The research emphasized the importance of transparency in LLM decision processes for building trust, debugging, and real-world deployment. This work was documented in arXiv preprint 2604.15371v1 with a cross-announcement type. The study maintained a consistent and reproducible experimental setup throughout its evaluation.

Key facts

Study compares three explainability techniques for LLMs
Methods evaluated: Integrated Gradients, Attention Rollout, SHAP
Used fine-tuned DistilBERT model for SST-2 sentiment classification
Gradient-based attribution provided most stable and intuitive explanations
Attention-based methods were computationally efficient but less aligned with prediction features
Model-agnostic approaches offered flexibility with higher computational cost
Focus was on practical evaluation rather than proposing new methods
Research addresses transparency challenges for LLM trust and deployment

Entities

—

Sources

arXiv cs.AI — 2026-04-20