VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is introduced as the first expressive end-to-end spoken language model capable of role-playing and singing generation. It uses a hybrid speech-text paradigm with multi-codebook audio tokens for richer paralinguistic representation. A data pipeline synthesized 15.8K hours of training data. The model outperforms peers by 7 percentage points on objective role-playing benchmarks.
Key facts
- VITA-QinYu is the first expressive end-to-end spoken language model for role-playing and singing.
- It adopts a hybrid speech-text paradigm with multi-codebook audio tokens.
- A data generation pipeline synthesized 15.8K hours of training data.
- The model outperforms peer SLMs by 7 percentage points on objective role-playing benchmarks.
- Human speech expressiveness includes personality, mood, and performance elements.
- The model extends interleaved text-audio modeling.
- The design preserves clear separation between modalities to avoid interference.
- The paper is published on arXiv with ID 2605.06765.
Entities
Institutions
- arXiv