ARTFEED — Contemporary Art Intelligence

Transcoders Reveal Visual Grounding in Vision-Language Models

ai-technology · 2026-05-25

A recent study presents Transcoders as a technique for understanding how Vision-Language Models (VLMs) convert visual information into textual form. In contrast to Sparse Autoencoders (SAEs), which focus on static representations, Transcoders offer a causal approximation for layer-wise computations by simulating MLP sublayers. When utilized with Gemma 3-4B-IT, this framework breaks down the model into clear pathways that connect image patches to the directions of token generation. The attributions from Transcoders demonstrate more robust and consistent impacts on visually grounded tokens during patch ablation compared to those from SAEs, and they align more closely with semantically meaningful image areas. A counterfactual analysis of False Visual Grounding validates the specificity of these pathways.

Key facts

  • Transcoders are sparse approximations of MLP sublayers
  • They act as a causal proxy for layer-wise computation
  • Applied to Gemma 3-4B-IT
  • Framework decomposes model into interpretable pathways
  • Links image patches to token generation directions
  • Transcoder attributions outperform SAE attributions
  • Better alignment with semantically relevant image regions
  • False Visual Grounding counterfactual analysis confirms pathway specificity

Entities

Sources