Vision language models quantify human visual exposome's impact on mental health
A team of researchers has employed vision language models (VLMs) to measure the semantic depth of human visual experiences, filling a crucial gap in understanding the impact of visual surroundings on mental health. By integrating ecological momentary assessment with VLMs, they examined 2,674 photographs submitted by participants. The predictions of momentary affect and chronic stress based on VLM-derived greenness estimates aligned well with established benchmarks. Additionally, they created a semi-autonomous large language model (LLM) pipeline that analyzed over seven million scientific articles to identify nearly 1,000 environmental factors linked to mental health. In real-world images, up to 33% of context ratings derived from VLMs significantly predicted mental health outcomes. This study, available on arXiv (ID: 2605.03863), offers a novel approach to capturing the first-person visual context of everyday life, surpassing basic geospatial proxies and biased self-reports.
Key facts
- Study uses VLMs to quantify visual exposome
- 2,674 participant-generated photographs analyzed
- Greenness estimates predict momentary affect and chronic stress
- LLM pipeline mined over 7 million scientific publications
- Nearly 1,000 environmental features extracted
- Up to 33% of VLM context ratings significantly predict mental health
- Published on arXiv with ID 2605.03863
- Method captures first-person visual context of daily life
Entities
Institutions
- arXiv