AI Framework Uses Street View to Assess Building Conditions Nationwide
A team of researchers has created a framework that utilizes multimodal large language models (LLMs) alongside Google Street View (GSV) images to assess building conditions throughout the United States automatically. By fine-tuning Gemma 3 27B with a small dataset labeled by humans, they achieved a strong correlation with human mean opinion scores (MOS), surpassing individual raters in SRCC and PLCC metrics. To enhance efficiency, knowledge distillation was employed to transfer skills to a smaller Gemma 3 4B model, which performed similarly with a threefold speed increase. Further distillation into a CNN-based EfficientNetV2-M and transformer SwinV2-B resulted in comparable performance with a 30x speed enhancement. The research also explores LLMs' capabilities in evaluating housing and built environment attributes, creating a visualization tool for the findings.
Key facts
- Framework uses multimodal LLMs and Google Street View imagery
- Fine-tuned Gemma 3 27B on human-labeled dataset
- Outperforms individual raters on SRCC and PLCC relative to MOS benchmark
- Knowledge distillation to Gemma 3 4B achieves 3x speedup
- Further distillation to EfficientNetV2-M and SwinV2-B achieves 30x speed gain
- Human-AI alignment study assesses built environment and housing attributes
- Visualization tool developed for results
- Published on arXiv under ID 2604.21102
Entities
Institutions
- arXiv
Locations
- United States