ARTFEED — Contemporary Art Intelligence

Token-Efficient Vision-Language Model for Pathology Report Generation

other · 2026-06-01

A new token-efficient vision-language model generates synoptic pathology reports from whole-slide images. The model handles gigapixel resolution and multiple slides per case using a frozen pathology patch encoder, a two-layer MLP aligner, and an LLM decoder with WSI marker tokens. Training occurs in two stages: WSI captioning on heterogeneous pairs, then case-level fine-tuning on report pairs. The approach reduces visual token sequences to fit constrained GPU memory.

Key facts

  • Model generates case-level synoptic pathology reports from whole-slide images
  • Architecture: frozen pathology patch encoder, two-layer MLP aligner, LLM decoder
  • Explicit WSI marker token separates slides within a case
  • Two-stage supervised training: WSI captioning then case-level fine-tuning
  • Designed for constrained GPU memory
  • Addresses gigapixel resolution and long visual-token sequences
  • Handles heterogeneous tissues and ambiguous findings
  • Published on arXiv as 2605.30716v1

Entities

Institutions

  • arXiv

Sources