PushupBench reveals VLMs struggle with repetition counting
Researchers have unveiled PushupBench, a benchmark consisting of 446 long-form video clips, each averaging 36.7 seconds, aimed at assessing the performance of vision-language models (VLMs) in counting repetitive actions like pushups. The leading frontier model achieves a mere 42.1% exact accuracy, while open-source models with 4 billion parameters only reach around 6%, comparable to basic supervised baselines. The findings indicate that relying solely on accuracy can be deceptive, as weaker models tend to predict the modal count instead of engaging in temporal reasoning. Fine-tuning with just 1,000 samples enhances general video understanding tasks, boosting MVBench by 2.15, PerceptionTest by 1.88, and TVBench by 4.54, highlighting counting as a proxy for temporal reasoning. The benchmark is available online and integrated into lmms-eval.
Key facts
- PushupBench contains 446 long-form clips averaging 36.7 seconds.
- Best frontier model achieves 42.1% exact accuracy.
- Open-source 4B models score ~6% exact accuracy.
- Weaker models exploit modal count rather than temporal reasoning.
- Fine-tuning on counting with 1k samples improves MVBench (+2.15), PerceptionTest (+1.88), TVBench (+4.54).
- Counting is proposed as a proxy for broader temporal video understanding.
- PushupBench is incorporated into lmms-eval and hosted online.
- The study is from computer science and computer vision research.
Entities
Institutions
- arXiv
- lmms-eval
- PushupBench