Token-Efficient Vision-Language Model for Pathology Report Generation

other · 2026-06-01

A new token-efficient vision-language model generates synoptic pathology reports from whole-slide images. The model handles gigapixel resolution and multiple slides per case using a frozen pathology patch encoder, a two-layer MLP aligner, and an LLM decoder with WSI marker tokens. Training occurs in two stages: WSI captioning on heterogeneous pairs, then case-level fine-tuning on report pairs. The approach reduces visual token sequences to fit constrained GPU memory.

Key facts

Model generates case-level synoptic pathology reports from whole-slide images
Architecture: frozen pathology patch encoder, two-layer MLP aligner, LLM decoder
Explicit WSI marker token separates slides within a case
Two-stage supervised training: WSI captioning then case-level fine-tuning
Designed for constrained GPU memory
Addresses gigapixel resolution and long visual-token sequences
Handles heterogeneous tissues and ambiguous findings
Published on arXiv as 2605.30716v1

Token-Efficient Vision-Language Model for Pathology Report Generation

Key facts

Entities

Institutions

Sources