New Framework Traces LLM Outputs to Specific Training Tokens
Researchers have developed a method to attribute large language model predictions to specific tokens in training data, addressing a critical need for reliability in healthcare applications. The framework, described in arXiv:2605.12809, uses sparse autoencoders attached to any layer of a pretrained LLM to learn approximately independent latent features. Unlike prior influence functions that assume token independence and are limited to autoregressive settings, this latent mediation approach computes influence over features that are inherently non-decomposable. The work enables token-level precision in identifying which training examples and which tokens within them influence a given output, akin to a medical case study. The method is flexible and applicable to general prediction tasks.
Key facts
- arXiv:2605.12809 introduces a framework for token-level influence attribution in LLMs.
- The method uses sparse autoencoders to learn independent latent features.
- Prior influence functions are restricted to autoregressive settings and assume token independence.
- The new approach computes influence over latent features that are non-decomposable.
- The work targets reliable LLM use in healthcare.
- The framework can be attached to any layer of a pretrained LLM.
- It enables pinpointing which tokens in training data influence a decision.
- The approach is described as a latent mediation method.
Entities
—