AI Language Models Struggle with Informal Text: Tokenization Failures and Distribution Shifts

ai-technology · 2026-04-22

A research investigation explores the impact of informal language on the accuracy of natural language inference (NLI) in two transformer models: RoBERTa-large (355M parameters) and ELECTRA-small (14M parameters). The researchers modified the SNLI and MultiNLI datasets by incorporating slang substitutions, emoji replacements, Gen-Z filler tokens, and various combinations. While slang substitutions resulted in a slight accuracy drop (maximum 1.1 percentage points) due to WordPiece coverage, emoji replacements posed significant challenges; ELECTRA's tokenizer converted many altered content words to [UNK] tokens, occurring in 93.6% of instances with an average of 2.91 per instance. Misinterpretations arose from noise tokens like 'no cap,' which were in-vocabulary but not included in the training data. The study highlights tokenization errors and distribution shifts as key challenges. This work is available on arXiv with the identifier 2604.16787v1.

Key facts

Study examines informal language impact on NLI accuracy in ELECTRA-small and RoBERTa-large models
Four transformations applied: slang substitution, emoji replacement, Gen-Z filler tokens, and combinations
Slang substitution causes minimal degradation (≤1.1pp) due to WordPiece coverage
Emoji replacement causes tokenization failures with 93.6% of examples containing [UNK] tokens
Average of 2.91 [UNK] tokens per emoji example
Noise tokens ('no cap,' 'deadass,' 'tbh') are in-vocabulary but absent from training data
Models assign inferential weight to noise tokens they don't actually carry
Research identifies tokenization failures and distribution shifts as primary failure modes

AI Language Models Struggle with Informal Text: Tokenization Failures and Distribution Shifts

Key facts

Entities

Institutions

Sources