Cross-Lingual Transfer in Language Models Studied via In-Vitro Framework
A recent investigation published on arXiv (2605.26683) examines cross-lingual transfer within language models through an in-vitro approach, utilizing two procedurally generated languages that have the same ontology, grammar, and composition but vary in surface appearance. The researchers conducted 700 controlled experiments, independently adjusting factors such as lexical distance, the proportion of minority language, tokenizer training methods, and vocabulary size. Their findings indicate that the effectiveness of transfer is influenced more by the preservation of reusable cross-lingual substructure during tokenization than by tokenizer balance or lexical similarity. Additionally, smaller vocabularies enhanced masked transfer by allowing words to be broken down into shared components, whereas larger vocabularies could impede this process. The focus of the study is on a masked minority-language scenario that was not encountered during training.
Key facts
- Study uses two procedurally generated languages with shared ontology and grammar.
- Languages differ only in surface realization.
- 700 controlled runs conducted.
- Variables: lexical distance, minority-language proportion, tokenizer training regime, vocabulary size.
- Transfer governed by preservation of cross-lingual substructure.
- Smaller vocabularies improve masked transfer.
- Larger vocabularies can turn forms into language-specific tokens.
- Minority-language condition never observed during training.
Entities
Institutions
- arXiv