AI Models Show Divergent Consistency in Generating Exercise Prescriptions

ai-technology · 2026-04-22

A recent study available on arXiv evaluated the consistency of exercise prescriptions generated by three large language models: GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash. Each model produced prescriptions for six clinical scenarios 20 times under temperature=0 conditions, resulting in a total of 360 outputs. The analysis examined four aspects: semantic similarity, reproducibility, FITT classification, and safety expression. GPT-4.1 led with a mean semantic similarity score of 0.955, followed by Gemini 2.5 Flash at 0.950 and Claude Sonnet 4.6 at 0.903. Significant differences were noted between models (H = 458.41, p < .001). Notably, GPT-4.1 generated entirely unique outputs (100%), while Gemini 2.5 Flash had only 27.5% unique outputs, indicating that its high similarity score was due to text duplication. The findings suggest that semantic similarity metrics may not adequately reflect differences in model behavior. This research is detailed in arXiv preprint 2604.19598v1 with a cross announcement type.

Key facts

Study compared exercise prescription consistency across three LLMs: GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash
Each model generated prescriptions for six clinical scenarios 20 times under temperature=0 conditions
360 total outputs were analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression
GPT-4.1 had highest mean semantic similarity (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903)
Significant inter-model differences confirmed with H = 458.41, p < .001
GPT-4.1 produced 100% unique outputs with stable semantic content
Gemini 2.5 Flash showed only 27.5% unique outputs due to text repetition
Study published as arXiv:2604.19598v1 with announcement type cross

AI Models Show Divergent Consistency in Generating Exercise Prescriptions

Key facts

Entities

Institutions

Sources