ARTFEED — Contemporary Art Intelligence

Political Censorship Routing in Chinese LLMs Exposed

ai-technology · 2026-05-04

A study on arXiv (2603.18280) reveals that current alignment evaluation methods for language models fail because they measure concept detection and refusal rates, missing the crucial routing layer between detection and behavioral policy. Using political censorship in Chinese-origin models as a natural experiment, researchers from five labs tested nine open-weight models with probes, surgical ablations, and behavioral tests. They found that probe accuracy alone is non-diagnostic—political probes, null controls, and permutation baselines all achieve 100%, making held-out category generalization the informative test. Surgical ablation uncovered lab-specific routing: removing the political-sensitivity direction eliminated censorship and restored accurate factual output in most models, but one model confabulated due to entangled factual knowledge and censorship mechanisms. The paper is available on arXiv.

Key facts

  • arXiv paper 2603.18280 published March 2025
  • Study examines political censorship in Chinese-origin language models
  • Nine open-weight models from five labs tested
  • Probe accuracy alone is non-diagnostic (all baselines reach 100%)
  • Surgical ablation reveals lab-specific routing mechanisms
  • Removing political-sensitivity direction eliminates censorship in most models
  • One model confabulates due to entangled architecture
  • Held-out category generalization is the informative test

Entities

Institutions

  • arXiv

Sources