Lightweight Transformer Predicts Iconic Gestures from Text and Emotion
A team of researchers has introduced a streamlined transformer model designed to predict emotion-sensitive iconic gestures during robot co-speech. This model determines both the position and intensity of gestures based solely on text and emotional context, eliminating the need for audio input during inference. It surpasses GPT-4o in performance on the BEAT2 dataset, excelling in both the classification of semantic gesture placement and the regression of intensity. Additionally, it maintains a compact computational footprint, making it ideal for real-time use in embodied agents.
Key facts
- Co-speech gestures increase engagement and improve speech understanding.
- Most data-driven robot systems generate rhythmic beat-like motion, few integrate semantic emphasis.
- The proposed model is a lightweight transformer.
- It derives iconic gesture placement and intensity from text and emotion.
- No audio input is required at inference time.
- The model outperforms GPT-4o on the BEAT2 dataset.
- It is computationally compact and suitable for real-time deployment.
- The research is categorized under Computer Science > Robotics.
Entities
Institutions
- arXiv