PLaMo 2.1-VL Vision Language Model Released for Japanese-Language Autonomous Devices

ai-technology · 2026-04-22

A lightweight Vision Language Model called PLaMo 2.1-VL has been introduced for autonomous devices, featuring 8B and 2B variants optimized for local and edge deployment with Japanese-language operation. The model's core capabilities focus on Visual Question Answering and Visual Grounding, with development targeting two real-world applications: factory task analysis through tool recognition and infrastructure anomaly detection. Performance evaluations show the model achieving 61.5 ROUGE-L on the JA-VG-VQA-500 benchmark and 85.2% accuracy on Japanese Ref-L4. For practical applications, it reaches 53.9% zero-shot accuracy on factory task analysis, while fine-tuning on power plant data improves anomaly detection performance from 39.7 to 64.9 F1-score. The development included creation of a large-scale synthetic data generation pipeline along with comprehensive Japanese training and evaluation resources. Technical details were published in a report available through arXiv, with the model outperforming comparable open models on both Japanese and English benchmarks.

Key facts

PLaMo 2.1-VL is a lightweight Vision Language Model for autonomous devices
Available in 8B and 2B variants designed for local and edge deployment
Operates with Japanese-language capabilities
Core capabilities include Visual Question Answering and Visual Grounding
Targets factory task analysis via tool recognition and infrastructure anomaly detection
Achieves 61.5 ROUGE-L on JA-VG-VQA-500 benchmark
Scores 85.2% accuracy on Japanese Ref-L4
Fine-tuning on power plant data improves anomaly detection F1-score from 39.7 to 64.9

PLaMo 2.1-VL Vision Language Model Released for Japanese-Language Autonomous Devices

Key facts

Entities

Institutions

Sources