On-Device LLMs Benchmarked for Clinical Decision Support

ai-technology · 2026-05-01

A recent study investigates the effectiveness of on-device large language models (LLMs) in aiding clinical decision-making, contrasting proprietary models with open-source alternatives. It examines a range of models, including gpt-oss at 20B and 120B parameters, Qwen3.5 with versions ranging from 9B to 35B, and Gemma 4 at 31B. The research focuses on three specific clinical areas: general disease diagnosis, eye-related issues, and expert grading simulation. Performance metrics are compared to proprietary models like GPT-5.1 and Gemini 3.1 Pro. Additionally, the study discusses the potential of gpt-oss-20B and Qwen3.5-35B for general diagnostics in under-resourced clinics.

Key facts

Study benchmarks on-device LLMs for clinical decision support
Models from gpt-oss, Qwen3.5, and Gemma 4 families evaluated
Three clinical tasks: general diagnosis, ophthalmology, expert grading simulation
Comparison with GPT-5.1, GPT-5-mini, Gemini 3.1 Pro, and DeepSeek-R1
Fine-tuning of gpt-oss-20b and Qwen3.5-35B on general diagnostic data
Addresses privacy and cloud infrastructure concerns
Aims to enable local inference in resource-constrained settings
Published on arXiv with ID 2601.03266

On-Device LLMs Benchmarked for Clinical Decision Support

Key facts

Entities

Institutions

Sources