Perplexity Gap Reveals LLM Finetuning Objectives
A new method uses perplexity differencing to identify finetuning objectives in large language models, including harmful behaviors. Researchers generated diverse completions from finetuned models using random prefills, then ranked them by decreasing perplexity gap between reference and finetuned models. Top-ranked completions often reveal finetuning goals without requiring model internals or prior assumptions. The approach was evaluated on 76 model organisms ranging from 0.5 to 70 billion parameters.
Key facts
- Finetuning can introduce harmful behaviors in LLMs.
- Model organisms are models finetuned for specific known behaviors.
- Perplexity-based method surfaces finetuning objectives.
- Method uses short random prefills from general corpora.
- Completions ranked by decreasing perplexity gap.
- No model internals or prior assumptions needed.
- Evaluated on 76 model organisms.
- Model sizes range from 0.5 to 70 billion parameters.
Entities
—