Triadic Suffix Tokenization Scheme Aims to Improve LLM Numerical Reasoning

ai-technology · 2026-04-20

A novel tokenization technique, known as Triadic Suffix Tokenization (TST), has been developed to tackle the numerical reasoning shortcomings observed in large language models. Traditional subword tokenization methods often inconsistently break apart numbers, leading to a loss of positional and decimal integrity, which significantly contributes to mistakes in arithmetic and scientific computations. The TST approach systematically divides digits into triads of three and labels each triad with a specific magnitude marker, establishing a clear, one-to-one correspondence between suffixes and magnitude orders for integers (thousands, millions, billions) and a similar framework for fractional values (tenths, thousandths, millionths). Unlike positional inference methods, TST offers a reliable gradient signal for improved model training stability. Two variations of implementation are suggested, including a vocabulary-based method that can add up to 10,000 fixed tokens to an existing vocabulary. This research was shared on arXiv with the identifier 2604.11582v2, categorized under the replace-cross announcement type.

Key facts

Triadic Suffix Tokenization (TST) is a new tokenization method for numerical reasoning in LLMs
Standard subword tokenization fragments numbers inconsistently, causing loss of positional and decimal structure
TST partitions digits into three-digit triads with explicit magnitude markers
The scheme creates fixed mapping between suffixes and orders of magnitude for integer parts
A parallel system handles fractional depth with replicated markers
TST provides consistent gradient signal for stable convergence
Two implementation variants are proposed, including vocabulary-based approach
Research announced on arXiv with identifier 2604.11582v2 under replace-cross type

Triadic Suffix Tokenization Scheme Aims to Improve LLM Numerical Reasoning

Key facts

Entities

Institutions

Sources