ARTFEED — Contemporary Art Intelligence

Raon-Speech: A 9B-Parameter Speech Language Model for English and Korean

ai-technology · 2026-05-26

Raon-Speech is a speech language model (SpeechLM) with 9 billion parameters, designed for both English and Korean, and is adept at understanding, generating, and responding to speech. It converts a pre-trained LLM into a SpeechLM while maintaining robust text functionalities. The training involved 1.38 million hours of carefully selected speech and text data, executed in three phases: aligning speech modules, conducting end-to-end pre-training with knowledge distillation, and optimizing post-training through multi-task preference. An additional feature, Raon-SpeechChat, facilitates natural real-time conversations in full-duplex mode. In comparisons across 42 benchmarks for English and Korean, Raon-Speech outperformed eight other recent audio foundation models, including Qwen2.5-Omni and Fun-Audio-Cha. The technical report can be found on arXiv.

Key facts

  • Raon-Speech is a 9B-parameter speech language model for English and Korean.
  • It handles speech understanding, answering, and generation.
  • Raon-SpeechChat is a full-duplex extension for real-time conversation.
  • The model preserves strong text capabilities from a pre-trained LLM.
  • Trained on 1.38M hours of curated speech and text datasets.
  • Training stages: speech modules alignment, end-to-end pre-training with knowledge distillation, multi-task preference optimization.
  • Evaluated on 42 English and Korean benchmarks.
  • Outperformed eight similar models including Qwen2.5-Omni and Fun-Audio-Cha.

Entities

Institutions

  • arXiv

Sources