On-Device LLMs Benchmarked for Clinical Decision Support
A recent study investigates the effectiveness of on-device large language models (LLMs) in aiding clinical decision-making, contrasting proprietary models with open-source alternatives. It examines a range of models, including gpt-oss at 20B and 120B parameters, Qwen3.5 with versions ranging from 9B to 35B, and Gemma 4 at 31B. The research focuses on three specific clinical areas: general disease diagnosis, eye-related issues, and expert grading simulation. Performance metrics are compared to proprietary models like GPT-5.1 and Gemini 3.1 Pro. Additionally, the study discusses the potential of gpt-oss-20B and Qwen3.5-35B for general diagnostics in under-resourced clinics.
Key facts
- Study benchmarks on-device LLMs for clinical decision support
- Models from gpt-oss, Qwen3.5, and Gemma 4 families evaluated
- Three clinical tasks: general diagnosis, ophthalmology, expert grading simulation
- Comparison with GPT-5.1, GPT-5-mini, Gemini 3.1 Pro, and DeepSeek-R1
- Fine-tuning of gpt-oss-20b and Qwen3.5-35B on general diagnostic data
- Addresses privacy and cloud infrastructure concerns
- Aims to enable local inference in resource-constrained settings
- Published on arXiv with ID 2601.03266
Entities
Institutions
- arXiv