Vision language models quantify human visual exposome's impact on mental health

ai-technology · 2026-05-07

A team of researchers has employed vision language models (VLMs) to measure the semantic depth of human visual experiences, filling a crucial gap in understanding the impact of visual surroundings on mental health. By integrating ecological momentary assessment with VLMs, they examined 2,674 photographs submitted by participants. The predictions of momentary affect and chronic stress based on VLM-derived greenness estimates aligned well with established benchmarks. Additionally, they created a semi-autonomous large language model (LLM) pipeline that analyzed over seven million scientific articles to identify nearly 1,000 environmental factors linked to mental health. In real-world images, up to 33% of context ratings derived from VLMs significantly predicted mental health outcomes. This study, available on arXiv (ID: 2605.03863), offers a novel approach to capturing the first-person visual context of everyday life, surpassing basic geospatial proxies and biased self-reports.

Key facts

Study uses VLMs to quantify visual exposome
2,674 participant-generated photographs analyzed
Greenness estimates predict momentary affect and chronic stress
LLM pipeline mined over 7 million scientific publications
Nearly 1,000 environmental features extracted
Up to 33% of VLM context ratings significantly predict mental health
Published on arXiv with ID 2605.03863
Method captures first-person visual context of daily life

Vision language models quantify human visual exposome's impact on mental health

Key facts

Entities

Institutions

Sources