ARTFEED — Contemporary Art Intelligence

FormalASR: End-to-End Spoken Chinese to Formal Text

ai-technology · 2026-05-20

FormalASR is a pair of compact end-to-end models (0.6B and 1.7B parameters) that directly transcribe spoken Chinese into formal written text, bypassing the traditional two-stage ASR+LLM pipeline. The models are fine-tuned from Qwen3-ASR using supervised fine-tuning on two newly constructed datasets: WenetSpeech-Formal and Speechio-Formal. These datasets were built by rewriting informal speech transcripts into formal text using LLMs, followed by quality filtering. Experiments show that FormalASR achieves up to 37.4% relative Character Error Rate (CER) reduction compared to verbatim baselines. The approach reduces latency and memory costs, making it suitable for on-device deployment. The research is published on arXiv under the identifier 2605.19266.

Key facts

  • FormalASR is an end-to-end model for spoken Chinese to formal text.
  • Two model sizes: 0.6B and 1.7B parameters.
  • Fine-tuned from Qwen3-ASR.
  • Two new datasets: WenetSpeech-Formal and Speechio-Formal.
  • Datasets built via LLM rewriting and quality filtering.
  • Up to 37.4% relative CER reduction over verbatim baselines.
  • Aims to reduce latency and memory for on-device deployment.
  • Published on arXiv:2605.19266.

Entities

Institutions

  • arXiv

Sources