ARTFEED — Contemporary Art Intelligence

Medical CoT Distillation Improves Answers but Worsens Reasoning

other · 2026-05-28

A recent investigation published on arXiv indicates that employing chain-of-thought (CoT) distillation in medical question-answering can enhance the accuracy of final responses, albeit at the cost of reasoning trace factuality. The study involved distilling a Qwen3-8B student model from a teacher model within the DeepSeek-V3-family, which was subsequently evaluated using MedQA-USMLE. The student model demonstrated improved answer accuracy, rising from 74.7% to 84.4% (SC@64), and enhanced calibration, with ECE decreasing from 0.096 to 0.034. Conversely, an audit conducted by a blind LLM-judge with Kimi-K2.6 revealed an increase in error rates for non-abstained steps, from 30.6% to 50.3%. This inverse correlation between answer quality and trace factuality was consistent across various evaluators, student scales, teacher strengths, medical benchmarks, and controls, challenging the notion that distillation benefits both metrics.

Key facts

  • Qwen3-8B student distilled from DeepSeek-V3-family teacher
  • MedQA-USMLE accuracy SC@64 improved from 74.7% to 84.4%
  • ECE improved from 0.096 to 0.034
  • Error rate on non-abstained steps rose from 30.6% to 50.3%
  • Audit performed by Kimi-K2.6 style-blind LLM judge
  • Pattern held across multiple evaluators and model scales
  • Study conducted on medical QA benchmarks
  • arXiv paper ID: 2605.28301

Entities

Institutions

  • arXiv

Sources