PLaMo 2.1-VL Vision Language Model Released for Japanese-Language Autonomous Devices
A lightweight Vision Language Model called PLaMo 2.1-VL has been introduced for autonomous devices, featuring 8B and 2B variants optimized for local and edge deployment with Japanese-language operation. The model's core capabilities focus on Visual Question Answering and Visual Grounding, with development targeting two real-world applications: factory task analysis through tool recognition and infrastructure anomaly detection. Performance evaluations show the model achieving 61.5 ROUGE-L on the JA-VG-VQA-500 benchmark and 85.2% accuracy on Japanese Ref-L4. For practical applications, it reaches 53.9% zero-shot accuracy on factory task analysis, while fine-tuning on power plant data improves anomaly detection performance from 39.7 to 64.9 F1-score. The development included creation of a large-scale synthetic data generation pipeline along with comprehensive Japanese training and evaluation resources. Technical details were published in a report available through arXiv, with the model outperforming comparable open models on both Japanese and English benchmarks.
Key facts
- PLaMo 2.1-VL is a lightweight Vision Language Model for autonomous devices
- Available in 8B and 2B variants designed for local and edge deployment
- Operates with Japanese-language capabilities
- Core capabilities include Visual Question Answering and Visual Grounding
- Targets factory task analysis via tool recognition and infrastructure anomaly detection
- Achieves 61.5 ROUGE-L on JA-VG-VQA-500 benchmark
- Scores 85.2% accuracy on Japanese Ref-L4
- Fine-tuning on power plant data improves anomaly detection F1-score from 39.7 to 64.9
Entities
Institutions
- arXiv