FormalASR: End-to-End Spoken Chinese to Formal Text

ai-technology · 2026-05-20

FormalASR is a pair of compact end-to-end models (0.6B and 1.7B parameters) that directly transcribe spoken Chinese into formal written text, bypassing the traditional two-stage ASR+LLM pipeline. The models are fine-tuned from Qwen3-ASR using supervised fine-tuning on two newly constructed datasets: WenetSpeech-Formal and Speechio-Formal. These datasets were built by rewriting informal speech transcripts into formal text using LLMs, followed by quality filtering. Experiments show that FormalASR achieves up to 37.4% relative Character Error Rate (CER) reduction compared to verbatim baselines. The approach reduces latency and memory costs, making it suitable for on-device deployment. The research is published on arXiv under the identifier 2605.19266.

Key facts

FormalASR is an end-to-end model for spoken Chinese to formal text.
Two model sizes: 0.6B and 1.7B parameters.
Fine-tuned from Qwen3-ASR.
Two new datasets: WenetSpeech-Formal and Speechio-Formal.
Datasets built via LLM rewriting and quality filtering.
Up to 37.4% relative CER reduction over verbatim baselines.
Aims to reduce latency and memory for on-device deployment.
Published on arXiv:2605.19266.

FormalASR: End-to-End Spoken Chinese to Formal Text

Key facts

Entities

Institutions

Sources