Medical CoT Distillation Improves Answers but Worsens Reasoning

other · 2026-05-28

A recent investigation published on arXiv indicates that employing chain-of-thought (CoT) distillation in medical question-answering can enhance the accuracy of final responses, albeit at the cost of reasoning trace factuality. The study involved distilling a Qwen3-8B student model from a teacher model within the DeepSeek-V3-family, which was subsequently evaluated using MedQA-USMLE. The student model demonstrated improved answer accuracy, rising from 74.7% to 84.4% (SC@64), and enhanced calibration, with ECE decreasing from 0.096 to 0.034. Conversely, an audit conducted by a blind LLM-judge with Kimi-K2.6 revealed an increase in error rates for non-abstained steps, from 30.6% to 50.3%. This inverse correlation between answer quality and trace factuality was consistent across various evaluators, student scales, teacher strengths, medical benchmarks, and controls, challenging the notion that distillation benefits both metrics.

Key facts

Qwen3-8B student distilled from DeepSeek-V3-family teacher
MedQA-USMLE accuracy SC@64 improved from 74.7% to 84.4%
ECE improved from 0.096 to 0.034
Error rate on non-abstained steps rose from 30.6% to 50.3%
Audit performed by Kimi-K2.6 style-blind LLM judge
Pattern held across multiple evaluators and model scales
Study conducted on medical QA benchmarks
arXiv paper ID: 2605.28301

Medical CoT Distillation Improves Answers but Worsens Reasoning

Key facts

Entities

Institutions

Sources