Transcoders Reveal Visual Grounding in Vision-Language Models

ai-technology · 2026-05-25

A recent study presents Transcoders as a technique for understanding how Vision-Language Models (VLMs) convert visual information into textual form. In contrast to Sparse Autoencoders (SAEs), which focus on static representations, Transcoders offer a causal approximation for layer-wise computations by simulating MLP sublayers. When utilized with Gemma 3-4B-IT, this framework breaks down the model into clear pathways that connect image patches to the directions of token generation. The attributions from Transcoders demonstrate more robust and consistent impacts on visually grounded tokens during patch ablation compared to those from SAEs, and they align more closely with semantically meaningful image areas. A counterfactual analysis of False Visual Grounding validates the specificity of these pathways.

Key facts

Transcoders are sparse approximations of MLP sublayers
They act as a causal proxy for layer-wise computation
Applied to Gemma 3-4B-IT
Framework decomposes model into interpretable pathways
Links image patches to token generation directions
Transcoder attributions outperform SAE attributions
Better alignment with semantically relevant image regions
False Visual Grounding counterfactual analysis confirms pathway specificity

Entities

—

Sources

arXiv cs.AI — 2026-05-25