Detecting OOD Text via SAE Layer Transitions in LLMs
A new arXiv preprint (2605.11920) proposes using sparse autoencoder (SAE) representations across layer transitions to detect out-of-domain (OOD) inputs in large language models (LLMs). The method treats internal model dynamics as interpretable signals, offering lightweight learning approaches that distinguish OOD texts. Benchmarked on Gemma-2 2B and 9B models, the approach outperforms black-box detectors and provides insight into LLM internal processing.
Key facts
- arXiv paper 2605.11920
- Uses sparse autoencoder (SAE) on layer transitions
- Detects out-of-domain (OOD) interactions
- Benchmarked on Gemma-2 2B and 9B models
- Lightweight learning methods for domain-specific signatures
- Improves interpretability of LLM decisions
- Addresses domain-specific application challenges
- Treats LLM as interpretable rather than black box
Entities
Institutions
- arXiv