FocusDepth: Region-Aware Monocular Depth Estimation

other · 2026-05-13

A new task called Focusable Monocular Depth Estimation (FDE) has been presented by researchers, focusing on enhancing depth estimation accuracy in foreground areas, ensuring clear boundaries, and maintaining coherent global geometry for designated target regions. The innovative FocusDepth framework utilizes box and text prompts to steer depth modeling, incorporating a Multi-Scale Spatial-Aligned Fusion (MSSA) module that aligns features from the Segment Anything Model 3 with the Depth Anything series. This method seeks to address the shortcomings of uniform pixel-wise objectives found in monocular depth foundation models.

Key facts

Focusable Monocular Depth Estimation (FDE) is a region-aware depth estimation task.
FocusDepth is a prompt-conditioned monocular relative depth estimation framework.
Multi-Scale Spatial-Aligned Fusion (MSSA) aligns multi-scale features from Segment Anything Model 3 to the Depth Anything family.
The method uses box/text prompts to specify target regions.
The approach prioritizes foreground depth accuracy, sharp boundary transitions, and coherent global scene geometry.
The paper is published on arXiv with ID 2605.11756.
The research addresses limitations of uniform pixel-wise objectives in monocular depth models.

FocusDepth: Region-Aware Monocular Depth Estimation

Key facts

Entities

Institutions

Sources