Token-Efficient Vision-Language Model for Pathology Report Generation
A new token-efficient vision-language model generates synoptic pathology reports from whole-slide images. The model handles gigapixel resolution and multiple slides per case using a frozen pathology patch encoder, a two-layer MLP aligner, and an LLM decoder with WSI marker tokens. Training occurs in two stages: WSI captioning on heterogeneous pairs, then case-level fine-tuning on report pairs. The approach reduces visual token sequences to fit constrained GPU memory.
Key facts
- Model generates case-level synoptic pathology reports from whole-slide images
- Architecture: frozen pathology patch encoder, two-layer MLP aligner, LLM decoder
- Explicit WSI marker token separates slides within a case
- Two-stage supervised training: WSI captioning then case-level fine-tuning
- Designed for constrained GPU memory
- Addresses gigapixel resolution and long visual-token sequences
- Handles heterogeneous tissues and ambiguous findings
- Published on arXiv as 2605.30716v1
Entities
Institutions
- arXiv