AI Carb Counting Fails Reproducibility Test in 27,000-Query Study
A recent preprint study indicates that top AI models have difficulty accurately estimating carbohydrate levels from images of food, with significant discrepancies observed in repeated assessments. An unnamed researcher conducted the analysis on 13 food images using four AI models—OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro, and Google Gemini 3.1 Pro Preview—performing over 500 queries for each, amounting to 26,904 total queries. The findings revealed inconsistent estimates, with Gemini 2.5 Pro's paella predictions ranging between 55g and 484g. Claude exhibited the least variation, while Gemini models frequently exceeded 10-20%. Errors in food identification were noted in 8 out of 13 images. The study, which underscores systematic bias and stochastic variability, suggests conducting 3-5 queries for improved accuracy and will be submitted to Diabetologia for peer review.
Key facts
- 26,904 queries were made across 13 food photos to 4 AI models
- Models tested: OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro, Google Gemini 3.1 Pro Preview
- Gemini 2.5 Pro's paella estimates ranged from 55g to 484g (429g range)
- Claude showed least variation but was consistently wrong on a cheese sandwich (28g vs 40g)
- GPT-5.4 averaged 74g for the cheese sandwich (34g over)
- Food identification errors occurred in 8 of 13 images
- Confidence scores were not correlated with accuracy
- The study is a preprint being submitted to Diabetologia
Entities
Institutions
- OpenAI
- Anthropic
- iAPS
- Diabetologia