Working Memory Constraints Improve Transformer Learning Under Data Scarcity

ai-technology · 2026-04-24

A recent study published on arXiv explores how human-like working memory limitations can be incorporated into Transformer models. The researchers introduced attention mechanisms inspired by cognitive science, such as fixed-width windows and temporal decay, into modified GPT-2 architectures, which were trained from the ground up on datasets containing 10 million and 100 million words. The evaluation of their performance was conducted using grammatical judgment tasks (BLiMP) and compared against human reading time data. The findings reveal that these cognitively-inspired constraints, especially fixed-width attention, enhance grammatical accuracy in scenarios with limited training data. Additionally, these models demonstrated a better alignment with metrics of human processing, indicating that such constraints can positively influence linguistic representation in data-scarce environments.

Key facts

Study integrates human-like working memory constraints into Transformer architecture
Implements fixed-width window and temporal decay attention variants
Modified GPT-2 models trained from scratch on 10M and 100M word datasets
Evaluated on BLiMP grammatical judgment tasks and human reading time alignment
Fixed-width attention significantly improves grammatical accuracy under data scarcity
Constrained models show stronger alignment with human processing metrics
Constraints serve as beneficial inductive bias for robust linguistic representations
Findings are relevant for data-limited settings

Working Memory Constraints Improve Transformer Learning Under Data Scarcity

Key facts

Entities

Institutions

Sources