ARTFEED — Contemporary Art Intelligence

Translation Tax Not a Scalar: Counterfactual Audit of Chinese Benchmarks

publication · 2026-05-11

A new study challenges the assumption that the Translation Tax is a scalar phenomenon in multilingual benchmarks. Researchers audited English-to-Chinese translations using three proxy estimators: back-translation gaps, cue-score calibration, and a six-model native-control comparison. Results showed model-family effects rather than uniform benchmark effects. A same-item LLM-naturalization stress test revealed a residue dose-response, where high-residue items benefited from translation while low-residue items did not. The study concludes that the Translation Tax is not a single effect but a set of estimator- and item-dependent validity risks. The authors released per-cell evidence, the naturalization protocol, and human quality control data.

Key facts

  • The Translation Tax is often treated as a scalar in translated benchmarks.
  • Three proxy estimators were used: back-translation gaps, cue-score calibration, and six-model native-control comparison.
  • Back-translation gaps were small and parser-fragile.
  • Cue-score calibration did not predict item-level gains.
  • Six-model native-control comparison showed model-family rather than uniform benchmark effects.
  • A same-item LLM-naturalization stress test held answer, options, and content fixed while rewriting Chinese surface form.
  • After correcting a prompt-construction bug, the contrast no longer supported a model-family interaction.
  • High-residue items benefited from translation while low-residue items did not.
  • The result is a set of estimator- and item-dependent validity risks.
  • The study released per-cell evidence, the naturalization protocol, and human QC data.

Entities

Institutions

  • arXiv

Sources