AI Models Show Divergent Consistency in Generating Exercise Prescriptions
A recent study available on arXiv evaluated the consistency of exercise prescriptions generated by three large language models: GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash. Each model produced prescriptions for six clinical scenarios 20 times under temperature=0 conditions, resulting in a total of 360 outputs. The analysis examined four aspects: semantic similarity, reproducibility, FITT classification, and safety expression. GPT-4.1 led with a mean semantic similarity score of 0.955, followed by Gemini 2.5 Flash at 0.950 and Claude Sonnet 4.6 at 0.903. Significant differences were noted between models (H = 458.41, p < .001). Notably, GPT-4.1 generated entirely unique outputs (100%), while Gemini 2.5 Flash had only 27.5% unique outputs, indicating that its high similarity score was due to text duplication. The findings suggest that semantic similarity metrics may not adequately reflect differences in model behavior. This research is detailed in arXiv preprint 2604.19598v1 with a cross announcement type.
Key facts
- Study compared exercise prescription consistency across three LLMs: GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash
- Each model generated prescriptions for six clinical scenarios 20 times under temperature=0 conditions
- 360 total outputs were analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression
- GPT-4.1 had highest mean semantic similarity (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903)
- Significant inter-model differences confirmed with H = 458.41, p < .001
- GPT-4.1 produced 100% unique outputs with stable semantic content
- Gemini 2.5 Flash showed only 27.5% unique outputs due to text repetition
- Study published as arXiv:2604.19598v1 with announcement type cross
Entities
Institutions
- arXiv