Tool Choice in Language Models Is Linearly Readable and Steerable

ai-technology · 2026-05-11

A recent study published on arXiv indicates that the selection of tools in instruction-tuned language models can be interpreted and guided through their internal activations. The investigation involved 12 models, including Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1, ranging from 270M to 27B parameters. Researchers discovered that incorporating the mean difference between the average activations of two tools results in a tool selection accuracy of 77-100% for name-only single-turn prompts, reaching 93-100% for models with 4B+ parameters. The corresponding JSON arguments align with the new tool's schema. Furthermore, the same per-tool averages can identify potential errors: on Gemma 3 12B and 27B, queries with the smallest differences between top-1 and top-2 tools lead to 14-21 times more incorrect calls compared to those with larger differences. The causal influence is primarily directed within the output layer.

Key facts

Tool identity is linearly readable from internal activations
Steering by adding mean-difference switches tool with 77-100% accuracy
Accuracy reaches 93-100% for models 4B+ parameters
JSON arguments adapt to new tool's schema after steering
Small gap between top-1 and top-2 tools predicts 14-21x more errors
Causal effect concentrates in one direction of output layer
12 models tested across Gemma 3, Qwen 3, Qwen 2.5, Llama 3.1
Model sizes range from 270M to 27B parameters

Tool Choice in Language Models Is Linearly Readable and Steerable

Key facts

Entities

Institutions

Sources