Raon-Speech: A 9B-Parameter Speech Language Model for English and Korean
Raon-Speech is a speech language model (SpeechLM) with 9 billion parameters, designed for both English and Korean, and is adept at understanding, generating, and responding to speech. It converts a pre-trained LLM into a SpeechLM while maintaining robust text functionalities. The training involved 1.38 million hours of carefully selected speech and text data, executed in three phases: aligning speech modules, conducting end-to-end pre-training with knowledge distillation, and optimizing post-training through multi-task preference. An additional feature, Raon-SpeechChat, facilitates natural real-time conversations in full-duplex mode. In comparisons across 42 benchmarks for English and Korean, Raon-Speech outperformed eight other recent audio foundation models, including Qwen2.5-Omni and Fun-Audio-Cha. The technical report can be found on arXiv.
Key facts
- Raon-Speech is a 9B-parameter speech language model for English and Korean.
- It handles speech understanding, answering, and generation.
- Raon-SpeechChat is a full-duplex extension for real-time conversation.
- The model preserves strong text capabilities from a pre-trained LLM.
- Trained on 1.38M hours of curated speech and text datasets.
- Training stages: speech modules alignment, end-to-end pre-training with knowledge distillation, multi-task preference optimization.
- Evaluated on 42 English and Korean benchmarks.
- Outperformed eight similar models including Qwen2.5-Omni and Fun-Audio-Cha.
Entities
Institutions
- arXiv