AI Model Estimates Object Mass from Single RGB Image

ai-technology · 2026-05-07

Researchers have developed a physically structured framework for estimating object mass from a single RGB image. The approach addresses the ill-posed nature of mass prediction by aligning visual cues with physical factors. It uses monocular depth estimation to recover 3D geometry for volume, and a vision-language model to extract material semantics for density. These representations are fused via an instance-adaptive gating mechanism, producing physically guided latent factors for volume and density. The work is published on arXiv (2601.20303) and represents a step toward physically meaningful AI perception.

Key facts

Mass estimation from RGB images is challenging due to dependence on volume and density.
The framework uses monocular depth estimation for 3D geometry.
A vision-language model extracts material semantics.
An instance-adaptive gating mechanism fuses geometry, semantic, and appearance representations.
The method produces physically guided latent factors for volume and density.
The paper is available on arXiv with ID 2601.20303.
The approach constrains the space of plausible solutions using physical representations.
The work addresses the ill-posed nature of mass prediction from pixels.

AI Model Estimates Object Mass from Single RGB Image

Key facts

Entities

Institutions

Sources