Tool Choice in Language Models Is Linearly Readable and Steerable
A recent study published on arXiv indicates that the selection of tools in instruction-tuned language models can be interpreted and guided through their internal activations. The investigation involved 12 models, including Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1, ranging from 270M to 27B parameters. Researchers discovered that incorporating the mean difference between the average activations of two tools results in a tool selection accuracy of 77-100% for name-only single-turn prompts, reaching 93-100% for models with 4B+ parameters. The corresponding JSON arguments align with the new tool's schema. Furthermore, the same per-tool averages can identify potential errors: on Gemma 3 12B and 27B, queries with the smallest differences between top-1 and top-2 tools lead to 14-21 times more incorrect calls compared to those with larger differences. The causal influence is primarily directed within the output layer.
Key facts
- Tool identity is linearly readable from internal activations
- Steering by adding mean-difference switches tool with 77-100% accuracy
- Accuracy reaches 93-100% for models 4B+ parameters
- JSON arguments adapt to new tool's schema after steering
- Small gap between top-1 and top-2 tools predicts 14-21x more errors
- Causal effect concentrates in one direction of output layer
- 12 models tested across Gemma 3, Qwen 3, Qwen 2.5, Llama 3.1
- Model sizes range from 270M to 27B parameters
Entities
Institutions
- arXiv