LensVLM: Selective Context Expansion for Compressed Visual Text
Researchers have introduced LensVLM, an inference framework and post-training strategy designed for Vision Language Models (VLMs) to handle compressed textual images. This approach allows for the selective expansion of only pertinent areas back to their uncompressed state. Utilizing Qwen3.5-9B-Base, LensVLM achieves accuracy similar to the maximum performance of full-text at 4.3x effective compression and surpasses both text- and visual-compression baselines by up to 10.1x effective compression. This technique effectively mitigates the accuracy loss that arises when text characters are compressed below the effective resolution of the vision encoder. The findings are detailed in a paper available on arXiv (2605.07019).
Key facts
- LensVLM is an inference framework and post-training recipe for VLMs.
- It enables VLMs to scan compressed images and selectively expand relevant regions.
- Built on Qwen3.5-9B-Base.
- Maintains accuracy comparable to full-text upper bound at 4.3x compression.
- Outperforms baselines up to 10.1x effective compression.
- Addresses accuracy loss from character shrinking in compressed images.
- Paper available on arXiv with ID 2605.07019.
Entities
Institutions
- arXiv