Counterfactual Likelihood Test Measures Influence in Private Reasoning Channels
A recent technique detailed in arXiv:2605.19092 employs counterfactual likelihood tests to assess indirect influence among private reasoning pathways in AI systems. This method substitutes an upstream private block with a donor block of equivalent length while keeping public tokens and the downstream target constant, subsequently evaluating the negative-log-likelihood shift. Testing on a 7B role-channel reasoning model reveals that textual probes lack reliability: the raw n-gram overlap exaggerates leakage, the adjusted overlap remains inconsistent, and canary reproduction does not effectively differentiate. Counterfactual likelihood distinguishes between masked and unmasked conditions, with length matching addressing a RoPE positional confound.
Key facts
- Method uses counterfactual likelihood test for measuring influence between private reasoning channels
- Replaces upstream private block with length-matched donor block
- Holds public token sequence and downstream target fixed
- Measures downstream target's negative-log-likelihood shift
- Validated on a 7B role-channel reasoning model
- Textual probes are unreliable: raw n-gram overlap overstates leakage
- Corrected overlap remains noisy
- Canary reproduction reports no discrimination
Entities
Institutions
- arXiv