Translation Tax Not a Scalar: Counterfactual Audit of Chinese Benchmarks

publication · 2026-05-11

A new study challenges the assumption that the Translation Tax is a scalar phenomenon in multilingual benchmarks. Researchers audited English-to-Chinese translations using three proxy estimators: back-translation gaps, cue-score calibration, and a six-model native-control comparison. Results showed model-family effects rather than uniform benchmark effects. A same-item LLM-naturalization stress test revealed a residue dose-response, where high-residue items benefited from translation while low-residue items did not. The study concludes that the Translation Tax is not a single effect but a set of estimator- and item-dependent validity risks. The authors released per-cell evidence, the naturalization protocol, and human quality control data.

Key facts

The Translation Tax is often treated as a scalar in translated benchmarks.
Three proxy estimators were used: back-translation gaps, cue-score calibration, and six-model native-control comparison.
Back-translation gaps were small and parser-fragile.
Cue-score calibration did not predict item-level gains.
Six-model native-control comparison showed model-family rather than uniform benchmark effects.
A same-item LLM-naturalization stress test held answer, options, and content fixed while rewriting Chinese surface form.
After correcting a prompt-construction bug, the contrast no longer supported a model-family interaction.
High-residue items benefited from translation while low-residue items did not.
The result is a set of estimator- and item-dependent validity risks.
The study released per-cell evidence, the naturalization protocol, and human QC data.

Translation Tax Not a Scalar: Counterfactual Audit of Chinese Benchmarks

Key facts

Entities

Institutions

Sources