VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

ai-technology · 2026-05-11

VITA-QinYu is introduced as the first expressive end-to-end spoken language model capable of role-playing and singing generation. It uses a hybrid speech-text paradigm with multi-codebook audio tokens for richer paralinguistic representation. A data pipeline synthesized 15.8K hours of training data. The model outperforms peers by 7 percentage points on objective role-playing benchmarks.

Key facts

VITA-QinYu is the first expressive end-to-end spoken language model for role-playing and singing.
It adopts a hybrid speech-text paradigm with multi-codebook audio tokens.
A data generation pipeline synthesized 15.8K hours of training data.
The model outperforms peer SLMs by 7 percentage points on objective role-playing benchmarks.
Human speech expressiveness includes personality, mood, and performance elements.
The model extends interleaved text-audio modeling.
The design preserves clear separation between modalities to avoid interference.
The paper is published on arXiv with ID 2605.06765.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

Key facts

Entities

Institutions

Sources