Political Censorship Routing in Chinese LLMs Exposed

ai-technology · 2026-05-04

A study on arXiv (2603.18280) reveals that current alignment evaluation methods for language models fail because they measure concept detection and refusal rates, missing the crucial routing layer between detection and behavioral policy. Using political censorship in Chinese-origin models as a natural experiment, researchers from five labs tested nine open-weight models with probes, surgical ablations, and behavioral tests. They found that probe accuracy alone is non-diagnostic—political probes, null controls, and permutation baselines all achieve 100%, making held-out category generalization the informative test. Surgical ablation uncovered lab-specific routing: removing the political-sensitivity direction eliminated censorship and restored accurate factual output in most models, but one model confabulated due to entangled factual knowledge and censorship mechanisms. The paper is available on arXiv.

Key facts

arXiv paper 2603.18280 published March 2025
Study examines political censorship in Chinese-origin language models
Nine open-weight models from five labs tested
Probe accuracy alone is non-diagnostic (all baselines reach 100%)
Surgical ablation reveals lab-specific routing mechanisms
Removing political-sensitivity direction eliminates censorship in most models
One model confabulates due to entangled architecture
Held-out category generalization is the informative test

Political Censorship Routing in Chinese LLMs Exposed

Key facts

Entities

Institutions

Sources