Visual Text Compression as Measure Transport
A recent preprint on arXiv introduces a novel approach for assessing visual text compression (VTC), likening it to measure transport concepts. The method transforms text into image formats suitable for vision-language models, enabling a token reduction between three and twenty times when compared to traditional subword tokenization. Despite this decrease, reduced tokens do not always enhance subsequent task performance. The researchers characterize text and visual tokens as probability measures, demonstrating how the ViT patch encoder generates a push-forward mapping. This framework evaluates information loss, correlating compression effectiveness to practical applications in the field.
Key facts
- Visual text compression (VTC) renders text into an image for re-encoding by a vision-language model.
- VTC produces 3–20× fewer decoder tokens than subword tokenization.
- Token savings do not predict downstream task performance.
- The paper formulates VTC in the language of measure transport.
- Text and visual tokens are treated as empirical probability measures.
- The ViT patch encoder induces a push-forward map.
- Transport cost decomposes into precision cost (within-patch aggregation) and coverage cost (cross-patch interactions).
- The framework provides a principled measure of task-relevant information loss.
Entities
Institutions
- arXiv