Counterfactual Likelihood Test Measures Influence in Private Reasoning Channels

ai-technology · 2026-05-20

A recent technique detailed in arXiv:2605.19092 employs counterfactual likelihood tests to assess indirect influence among private reasoning pathways in AI systems. This method substitutes an upstream private block with a donor block of equivalent length while keeping public tokens and the downstream target constant, subsequently evaluating the negative-log-likelihood shift. Testing on a 7B role-channel reasoning model reveals that textual probes lack reliability: the raw n-gram overlap exaggerates leakage, the adjusted overlap remains inconsistent, and canary reproduction does not effectively differentiate. Counterfactual likelihood distinguishes between masked and unmasked conditions, with length matching addressing a RoPE positional confound.

Key facts

Method uses counterfactual likelihood test for measuring influence between private reasoning channels
Replaces upstream private block with length-matched donor block
Holds public token sequence and downstream target fixed
Measures downstream target's negative-log-likelihood shift
Validated on a 7B role-channel reasoning model
Textual probes are unreliable: raw n-gram overlap overstates leakage
Corrected overlap remains noisy
Canary reproduction reports no discrimination

Counterfactual Likelihood Test Measures Influence in Private Reasoning Channels

Key facts

Entities

Institutions

Sources