AI Chatbots Provide Inaccurate Medical Information Despite Authoritative Tone, Study Reveals
Researchers from the University of Tübingen conducted a thorough assessment of five AI chatbots—ChatGPT, Gemini, Grok, Meta AI, and DeepSeek—and found notable inaccuracies in their health-related answers. Out of 50 medical inquiries, approximately 20% of the responses were categorized as highly problematic, with half deemed problematic and 30% somewhat problematic. Grok led with 58% of its responses being problematic, while ChatGPT and Meta AI followed at 52% and 50%, respectively. The chatbots particularly struggled with questions regarding nutrition and athletic performance, facing challenges with open-ended queries, which had a 32% high problem rating. Published in BMJ Open, the study indicated a median completeness score of 40% for scientific references, urging users to independently verify health information.
Key facts
- Five AI chatbots were tested: ChatGPT, Gemini, Grok, Meta AI, and DeepSeek
- Researchers asked 50 health questions across five medical domains
- Two experts independently rated all answers
- Nearly 20% of answers were highly problematic, 50% problematic, 30% somewhat problematic
- Only two questions out of 250 were refused by the chatbots
- Grok performed worst with 58% problematic responses
- Chatbots achieved median reference completeness score of just 40%
- Study published in BMJ Open using February 2025 free versions
Entities
Artists
- Carsten Eickhoff
Institutions
- University of Tübingen
- BMJ Open
- Nature Medicine
- Jama Network Open
- Nature Communications Medicine
- The Conversation
Locations
- Tübingen
- Germany