Finetuning Qwen3-VL-Embedding-2B for Visual Document Retrieval Achieves 0.947 NDCG@10
A practical demonstration shows finetuning the Qwen/Qwen3-VL-Embedding-2B model for Visual Document Retrieval (VDR) significantly improves performance. The resulting tomaarsen/Qwen3-VL-Embedding-2B-vdr model achieves an NDCG@10 score of 0.947 on evaluation data, surpassing the base model's 0.888. This finetuned 2B parameter model outperforms all tested existing VDR models, including those up to four times larger. The training utilized the CachedMultipleNegativesRankingLoss with a mini_batch_size of 1 to manage memory constraints. MatryoshkaLoss was incorporated to enable effective embedding truncation, maintaining near-peak performance even at reduced dimensions. The dataset employed was tomaarsen/llamaindex-vdr-en-train-preprocessed, a preprocessed English subset. Training arguments included bfloat16 precision and a per_device_train_batch_size of 64. The InformationRetrievalEvaluator tracked retrieval metrics like NDCG@10 and MAP. The model's configuration defaults to producing 1024-dimensional embeddings to halve storage needs. The same Sentence Transformers infrastructure can also finetune multimodal Cross Encoder reranker models.
Key facts
- Finetuning improved NDCG@10 from 0.888 to 0.947.
- The finetuned 2B model outperformed existing VDR models up to 4x its size.
- Training used CachedMultipleNegativesRankingLoss with mini_batch_size=1.
- MatryoshkaLoss enabled effective embedding truncation for deployment.
- The dataset was tomaarsen/llamaindex-vdr-en-train-preprocessed.
- Training arguments included bf16=True and per_device_train_batch_size=64.
- The model defaults to 1024-dimensional embeddings to reduce storage.
- The same framework supports training multimodal reranker models.
Entities
Institutions
- Hugging Face
- LlamaIndex