ARTFEED — Contemporary Art Intelligence

Lightweight Transformer Predicts Iconic Gestures from Text and Emotion

ai-technology · 2026-04-25

A team of researchers has introduced a streamlined transformer model designed to predict emotion-sensitive iconic gestures during robot co-speech. This model determines both the position and intensity of gestures based solely on text and emotional context, eliminating the need for audio input during inference. It surpasses GPT-4o in performance on the BEAT2 dataset, excelling in both the classification of semantic gesture placement and the regression of intensity. Additionally, it maintains a compact computational footprint, making it ideal for real-time use in embodied agents.

Key facts

  • Co-speech gestures increase engagement and improve speech understanding.
  • Most data-driven robot systems generate rhythmic beat-like motion, few integrate semantic emphasis.
  • The proposed model is a lightweight transformer.
  • It derives iconic gesture placement and intensity from text and emotion.
  • No audio input is required at inference time.
  • The model outperforms GPT-4o on the BEAT2 dataset.
  • It is computationally compact and suitable for real-time deployment.
  • The research is categorized under Computer Science > Robotics.

Entities

Institutions

  • arXiv

Sources