Template Collapse in 3D Medical Report Generation Identified and Mitigated

other · 2026-06-01

A recent study published on arXiv reveals a phenomenon termed 'Template Collapse' in 3D medical vision-language models (VLMs). These models produce coherent yet overly generic radiology reports that tend to overlook infrequent but critical findings. This issue arises from limitations such as insufficient data, imbalanced labeling, and ineffective volumetric encoder signals, which promote shortcut learning. To address this, researchers introduce CLarGen, a distinct framework that separates clinical detection from language generation, utilizing a Latent Query Transformer for detecting multiple pathologies. The study thoroughly examines collapse through metrics related to clinical fidelity, output diversity, normal-template bias, and the survival of rare findings.

Key facts

Template Collapse is a failure mode in 3D medical VLMs causing generic reports.
Models under-report rare critical findings despite fluent text generation.
Constraints include limited data, severe label imbalance, and weak encoder signals.
CLarGen decouples detection from language synthesis.
CLarGen uses a Latent Query Transformer for multi-label pathology detection.
Diagnosis metrics: clinical fidelity, output diversity, normal-template bias, rare-finding survival.
Study published on arXiv with ID 2605.30984.
The research aims to improve pathology detection and output diversity.

Template Collapse in 3D Medical Report Generation Identified and Mitigated

Key facts

Entities

Institutions

Sources