Tokenizer Fertility Varies 1.6x Across Foundation Models on Ukrainian Legal Text

other · 2026-05-26

A recent study published on arXiv (2605.14890) evaluates seven foundational models from five different providers using 273 verified court rulings from Ukraine's state registry (EDRSR). The research assesses tokenizer fertility and zero-shot performance across three tasks, revealing a 1.6x variation in tokenizer fertility among the models, an often-overlooked aspect in model selection that impacts costs. The Qwen 3 models require 60% more tokens than Llama-family models for the same input, emphasizing the importance of tokenizer analysis for cost-effective implementation. The NVIDIA Nemotron Super 3 (120B) scored the highest composite score (83.1), surpassing Mistral Large 3 (which has 5.6x more parameters) at one-third the API cost, indicating that model size is not a reliable indicator of performance in this context. Additionally, few-shot prompting can reduce performance by as much as 26 percentage points; stratified and prompt-sensitivity tests confirm this issue is inherent to Ukrainian-language tasks.

Key facts

Tokenizer fertility varies 1.6x across foundation models on Ukrainian legal text.
Seven models from five providers benchmarked on 273 validated court decisions from EDRSR.
Qwen 3 models consume 60% more tokens than Llama-family models on identical input.
NVIDIA Nemotron Super 3 (120B) achieves highest composite score (83.1).
Nemotron outperforms Mistral Large 3 at one-third the API cost.
Few-shot prompting degrades performance by up to 26 percentage points.
Degradation is intrinsic to Ukrainian-language demonstrations.
Study is from arXiv:2605.14890.

Entities

Institutions

arXiv
Qwen
Llama
NVIDIA
Mistral
EDRSR

Locations

Ukraine

Sources

arXiv cs.AI — 2026-05-26