Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction
A team of researchers has introduced Gelina, a comprehensive framework that simultaneously generates speech and accompanying gestures from text by utilizing interleaved token sequences within a discrete autoregressive architecture featuring modality-specific decoders. In contrast to traditional approaches that produce speech and gestures one after the other, Gelina ensures alignment in both synchrony and prosody. The system is capable of cloning multiple speakers and styles and allows for gesture-only synthesis derived from speech inputs. Evaluation results indicate that Gelina achieves competitive speech quality and enhances gesture generation when compared to unimodal baselines. This research was published on arXiv (2510.12834v4) and marks progress toward more intuitive human-computer interactions.
Key facts
- Gelina jointly synthesizes speech and co-speech gestures from text.
- Uses interleaved token sequences in a discrete autoregressive backbone.
- Includes modality-specific decoders.
- Supports multi-speaker and multi-style cloning.
- Enables gesture-only synthesis from speech inputs.
- Demonstrates competitive speech quality and improved gesture generation.
- Published on arXiv with identifier 2510.12834v4.
- Addresses synchrony and prosody alignment in multimodal communication.
Entities
Institutions
- arXiv