Lightweight Transformer Predicts Iconic Gestures from Text and Emotion

ai-technology · 2026-04-25

A team of researchers has introduced a streamlined transformer model designed to predict emotion-sensitive iconic gestures during robot co-speech. This model determines both the position and intensity of gestures based solely on text and emotional context, eliminating the need for audio input during inference. It surpasses GPT-4o in performance on the BEAT2 dataset, excelling in both the classification of semantic gesture placement and the regression of intensity. Additionally, it maintains a compact computational footprint, making it ideal for real-time use in embodied agents.

Key facts

Co-speech gestures increase engagement and improve speech understanding.
Most data-driven robot systems generate rhythmic beat-like motion, few integrate semantic emphasis.
The proposed model is a lightweight transformer.
It derives iconic gesture placement and intensity from text and emotion.
No audio input is required at inference time.
The model outperforms GPT-4o on the BEAT2 dataset.
It is computationally compact and suitable for real-time deployment.
The research is categorized under Computer Science > Robotics.

Lightweight Transformer Predicts Iconic Gestures from Text and Emotion

Key facts

Entities

Institutions

Sources