ARTFEED — Contemporary Art Intelligence

Soro: Tajik-Specialized LLM Outperforms Gemma 3 on Local Benchmarks

ai-technology · 2026-05-28

A team of researchers has unveiled Soro, a series of conversational large language models (LLMs) tailored for the Tajik language, which has seen limited AI representation. These models are constructed from open-weight Gemma 3 checkpoints and undergo continual pretraining exclusively in Tajik, utilizing a meticulously curated corpus of 1.9 billion tokens that encompasses filtered web content, PDF files, and educational materials aligned with curricula. This is succeeded by supervised instruction tuning with 40,000 examples modeled after Tajik teachers. To assess their efficacy, the researchers developed a collection of Tajik benchmarks focused on general knowledge, linguistic skills, and academic entrance exams, which are available on Hugging Face. Soro significantly surpasses equivalent Gemma 3 models on these benchmarks while maintaining robust performance in English on standard datasets. The models are optimized for practical use in Tajikistan, considering constraints in compute and connectivity, with enhancements through FP8 and INT4 quantization.

Key facts

  • Soro is a family of Tajik-specialized conversational LLMs.
  • Built from open-weight Gemma 3 checkpoints.
  • Tajik-only continual pretraining on a 1.9-billion-token corpus.
  • Corpus includes filtered web text, PDF documents, and educational materials.
  • Supervised instruction tuning on 40,000 Tajik teacher-style examples.
  • New Tajik benchmarks introduced for evaluation.
  • Benchmarks cover general knowledge, linguistic competence, and exam domains.
  • Soro outperforms same-size Gemma 3 on Tajik benchmarks.
  • Retains strong English performance on standard datasets.
  • Designed for deployment under tight compute and connectivity constraints in Tajikistan.
  • Uses FP8 and INT4 quantization for efficiency.
  • Benchmarks open-sourced on Hugging Face.

Entities

Institutions

  • Hugging Face
  • Gemma 3

Locations

  • Tajikistan

Sources