Instruction-Tuned MLLMs Show Brain Alignment During Movie Watching
A study on arXiv (2506.08277) investigates whether instruction-tuned multimodal large language models (IT-MLLMs) align with brain activity during naturalistic movie watching. Researchers predicted fMRI responses from six video and two audio IT-MLLMs across 13 video task instructions, finding that instruction-tuning helps organize representations around functional task demands rather than surface semantics. The work addresses a gap in prior evaluations that focused on unimodal stimuli or non-instruction-tuned models.
Key facts
- Study published on arXiv with ID 2506.08277
- Investigates instruction-tuned multimodal large language models (IT-MLLMs)
- Uses fMRI responses recorded during naturalistic movie watching (video with audio)
- Tests six video and two audio IT-MLLMs
- Includes 13 video task instructions
- Finds instruction-tuning organizes representations around functional task demands
- Prior work focused on unimodal stimuli or non-instruction-tuned models
- Study addresses brain alignment under multimodal naturalistic stimuli
Entities
Institutions
- arXiv