Evian: A New Benchmark for Auditing Visual Instruction Data
A new framework named Evian has been developed by researchers to audit visual instruction-tuning data within Large Vision-Language Models (LVLMs). This initiative tackles the challenge of inconsistent quality in training data, an issue that existing filtering techniques struggle to resolve due to their broad scoring methods, which overlook intricate semantic issues such as logical inconsistencies or inaccuracies. The team created a benchmark consisting of 300,000 samples by intentionally introducing various subtle defects into the data, forming a rigorous testing environment. Additionally, they suggest a 'Decomposition-then-Evaluation' approach that dissects model outputs into cognitive elements: visual description, subjective inference, and factual assertion, allowing for focused assessment. The research is published on arXiv with ID 2604.20544.
Key facts
- Evian is a framework for explainable visual instruction-tuning data auditing.
- The work addresses inconsistent quality in LVLMs training data.
- Current filtering methods use coarse-grained scores that miss nuanced semantic flaws.
- A 300K-sample benchmark was constructed with systematically injected defects.
- A 'Decomposition-then-Evaluation' paradigm is introduced.
- Model responses are broken into visual description, subjective inference, and factual claim.
- The paper is available on arXiv with ID 2604.20544.
- The research targets improving reliability of LVLMs.
Entities
Institutions
- arXiv