Unsupervised Neural Networks Spontaneously Concatenate Speech

ai-technology · 2026-04-25

A recent study published on arXiv suggests that the fundamental process of syntax evolution—concatenation—can be directly modeled from unprocessed speech through unsupervised deep neural networks. The researchers utilized ciwGAN/fiwGAN models, which are based on convolutional neural networks, to train on acoustic recordings of single words. Remarkably, without any access to multi-word datasets, these models began producing outputs that included two or three words combined, a phenomenon referred to as "spontaneous concatenation." This result was consistently observed across multiple independently trained models with varying hyperparameters and datasets. Additionally, networks trained on pairs of words demonstrated the ability to create new, unseen combinations, indicating early signs of compositionality. This research questions the traditional text-centric focus of computational syntax models.

Key facts

arXiv:2305.01626v4
Announce Type: replace-cross
Focus on concatenation as a basic suboperation of syntax
Spontaneous concatenation observed in ciwGAN/fiwGAN models
Models trained on single words only
Outputs with two or three concatenated words emerged
Replicated in multiple models with different hyperparameters
Novel word combinations formed from two-word training
Precursors to compositionality detected in outputs

Unsupervised Neural Networks Spontaneously Concatenate Speech

Key facts

Entities

Institutions

Sources