MLLMs Outperform CNNs in Seizure Video Analysis

ai-technology · 2026-05-07

A recent pilot study published on arXiv investigates the effectiveness of multimodal large language models (MLLMs) in identifying pathological movements in seizure videos without prior training. Analyzing 90 clinical recordings alongside 20 semiological features defined by the ILAE, MLLMs surpassed fine-tuned CNN and ViT benchmarks on 13 out of 18 features. They showed strong performance in recognizing prominent postural and contextual elements, yet faced challenges with subtle, rapid movements. Enhancements targeting specific features, such as facial cropping, pose estimation, and audio denoising, led to improved results on 10 of the 20 features. This research underscores the promise of MLLMs for automated analysis of seizure semiology, despite challenges in detecting intricate motions.

Key facts

MLLMs evaluated on 90 clinical seizure recordings
20 ILAE-defined semiological features assessed
Zero-shot performance compared to fine-tuned CNN and ViT
MLLMs outperformed baselines on 13 of 18 features
Signal enhancement improved performance on 10 of 20 features
Study published on arXiv with ID 2605.03352

MLLMs Outperform CNNs in Seizure Video Analysis

Key facts

Entities

Institutions

Sources