Translation Tax Not a Scalar: Counterfactual Audit of Chinese Benchmarks
A new study challenges the assumption that the Translation Tax is a scalar phenomenon in multilingual benchmarks. Researchers audited English-to-Chinese translations using three proxy estimators: back-translation gaps, cue-score calibration, and a six-model native-control comparison. Results showed model-family effects rather than uniform benchmark effects. A same-item LLM-naturalization stress test revealed a residue dose-response, where high-residue items benefited from translation while low-residue items did not. The study concludes that the Translation Tax is not a single effect but a set of estimator- and item-dependent validity risks. The authors released per-cell evidence, the naturalization protocol, and human quality control data.
Key facts
- The Translation Tax is often treated as a scalar in translated benchmarks.
- Three proxy estimators were used: back-translation gaps, cue-score calibration, and six-model native-control comparison.
- Back-translation gaps were small and parser-fragile.
- Cue-score calibration did not predict item-level gains.
- Six-model native-control comparison showed model-family rather than uniform benchmark effects.
- A same-item LLM-naturalization stress test held answer, options, and content fixed while rewriting Chinese surface form.
- After correcting a prompt-construction bug, the contrast no longer supported a model-family interaction.
- High-residue items benefited from translation while low-residue items did not.
- The result is a set of estimator- and item-dependent validity risks.
- The study released per-cell evidence, the naturalization protocol, and human QC data.
Entities
Institutions
- arXiv