AI Carb Counting Fails Reproducibility Test in 27,000-Query Study

ai-technology · 2026-04-29

A recent preprint study indicates that top AI models have difficulty accurately estimating carbohydrate levels from images of food, with significant discrepancies observed in repeated assessments. An unnamed researcher conducted the analysis on 13 food images using four AI models—OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro, and Google Gemini 3.1 Pro Preview—performing over 500 queries for each, amounting to 26,904 total queries. The findings revealed inconsistent estimates, with Gemini 2.5 Pro's paella predictions ranging between 55g and 484g. Claude exhibited the least variation, while Gemini models frequently exceeded 10-20%. Errors in food identification were noted in 8 out of 13 images. The study, which underscores systematic bias and stochastic variability, suggests conducting 3-5 queries for improved accuracy and will be submitted to Diabetologia for peer review.

Key facts

26,904 queries were made across 13 food photos to 4 AI models
Models tested: OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro, Google Gemini 3.1 Pro Preview
Gemini 2.5 Pro's paella estimates ranged from 55g to 484g (429g range)
Claude showed least variation but was consistently wrong on a cheese sandwich (28g vs 40g)
GPT-5.4 averaged 74g for the cheese sandwich (34g over)
Food identification errors occurred in 8 of 13 images
Confidence scores were not correlated with accuracy
The study is a preprint being submitted to Diabetologia

Entities

Institutions

OpenAI
Anthropic
Google
iAPS
Diabetologia

Sources

Hacker News AI — 2026-04-29