Mechanistic Interpretability Study Reveals Censorship Circuit in Qwen 3.5 LLM

ai-technology · 2026-05-19

A mechanistic interpretability study of the Qwen 3.5-9B large language model reveals a small, identifiable circuit responsible for political censorship of topics related to the People's Republic of China (PRC). The censorship is not a result of missing factual knowledge—the base model (Qwen 3.5-9B-Base) provides accurate, Western-framed answers on topics like Tiananmen Square, Falun Gong, and Taiwan. Instead, post-training layers a behavioral filter on top of these facts. The circuit consists of two halves: a writer band (layers 11–20) that computes three internal directions—whether the prompt is PRC-sensitive (d_prc), whether to refuse (d_refuse), and whether to deflect or propagandize (d_style)—and a reader band (layers 20–31) that renders the decision into text. The verdict commits in Chinese tokens around layer 24, even for English prompts, but this Chinese intermediate is behaviorally inert. Steering these directions at the writer layer can flip the model's behavior: subtracting d_prc on a Tiananmen prompt yields a factual answer, while subtracting d_refuse jailbreaks harmful prompts. The filter is mostly PRC-specific but overgeneralizes to structurally similar non-PRC prompts (e.g., Kosovo, Catalonia, Saudi Arabia, Arab Spring self-immolation). The study was conducted by an independent researcher and published on a personal blog.

Key facts

Qwen 3.5-9B's political censorship is a small, identifiable circuit that can be found, read, and turned off.
The factual knowledge is already present in pretraining; censorship is behavior layered on top.
The circuit has two halves: writer band (layers 11–20) and reader band (layers 20–31).
Three internal directions are computed: d_prc (PRC-sensitive content), d_refuse (refusal), d_style (deflect vs. propaganda).
The verdict commits in Chinese tokens around layer 24, but this is behaviorally inert.
Steering at the writer layer can flip behavior: subtracting d_prc yields factual answers on PRC topics.
The filter is mostly PRC-specific but overgeneralizes to Kosovo, Catalonia, Saudi Arabia, and self-immolation prompts.
The model is small enough to run on a consumer RTX GPU, enabling cheap experiments.

Entities

Institutions

Qwen
Hugging Face

Locations

China
Tiananmen Square
Taiwan
Xinjiang
Hong Kong
Tibet
Falun Gong
Kosovo
Catalonia
Saudi Arabia
Arab Spring

Sources

Hacker News AI — 2026-05-19