Working Memory Constraints Improve Transformer Learning Under Data Scarcity
A recent study published on arXiv explores how human-like working memory limitations can be incorporated into Transformer models. The researchers introduced attention mechanisms inspired by cognitive science, such as fixed-width windows and temporal decay, into modified GPT-2 architectures, which were trained from the ground up on datasets containing 10 million and 100 million words. The evaluation of their performance was conducted using grammatical judgment tasks (BLiMP) and compared against human reading time data. The findings reveal that these cognitively-inspired constraints, especially fixed-width attention, enhance grammatical accuracy in scenarios with limited training data. Additionally, these models demonstrated a better alignment with metrics of human processing, indicating that such constraints can positively influence linguistic representation in data-scarce environments.
Key facts
- Study integrates human-like working memory constraints into Transformer architecture
- Implements fixed-width window and temporal decay attention variants
- Modified GPT-2 models trained from scratch on 10M and 100M word datasets
- Evaluated on BLiMP grammatical judgment tasks and human reading time alignment
- Fixed-width attention significantly improves grammatical accuracy under data scarcity
- Constrained models show stronger alignment with human processing metrics
- Constraints serve as beneficial inductive bias for robust linguistic representations
- Findings are relevant for data-limited settings
Entities
Institutions
- arXiv