ARTFEED — Contemporary Art Intelligence

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

ai-technology · 2026-05-11

VITA-QinYu is introduced as the first expressive end-to-end spoken language model capable of role-playing and singing generation. It uses a hybrid speech-text paradigm with multi-codebook audio tokens for richer paralinguistic representation. A data pipeline synthesized 15.8K hours of training data. The model outperforms peers by 7 percentage points on objective role-playing benchmarks.

Key facts

  • VITA-QinYu is the first expressive end-to-end spoken language model for role-playing and singing.
  • It adopts a hybrid speech-text paradigm with multi-codebook audio tokens.
  • A data generation pipeline synthesized 15.8K hours of training data.
  • The model outperforms peer SLMs by 7 percentage points on objective role-playing benchmarks.
  • Human speech expressiveness includes personality, mood, and performance elements.
  • The model extends interleaved text-audio modeling.
  • The design preserves clear separation between modalities to avoid interference.
  • The paper is published on arXiv with ID 2605.06765.

Entities

Institutions

  • arXiv

Sources