Surrogate modeling framework interprets black-box LLMs in medical predictions
Researchers propose a surrogate modeling framework to interpret how large language models (LLMs) encode knowledge, addressing their black-box nature. The framework approximates the latent LLM knowledge space using observable input-output pairs through extensive prompting across simulated scenarios. Proof-of-concept experiments in medical predictions reveal the extent to which LLMs perceive each input variable relative to the output, particularly concerning potential inaccuracies. The study is published on arXiv (2604.20331).
Key facts
- arXiv paper 2604.20331 proposes surrogate modeling for LLM interpretability.
- Framework uses simplified models to approximate complex LLM systems.
- Experiments focus on medical predictions as proof of concept.
- Method reveals how LLMs perceive input variables in relation to output.
- Addresses concerns about LLMs perpetuating inaccuracies.
Entities
Institutions
- arXiv