TorchSight: Open-Source Local LLM for Security Document Classification
TorchSight is an open-source local system designed for the classification of security documents, utilizing a fine-tuned Qwen 3.5 27B model. It has been trained on 78,358 samples sourced from 13 permissively licensed origins, along with synthetic data from GPT-4, encompassing seven security categories and 51 subcategories. In tests involving 1,000 documents, it achieved a category-level accuracy of 95.0% (95% CI: 93.5-96.2), surpassing commercial alternatives that recorded 75.4-79.9% under identical conditions. Additionally, when evaluated on a separate external dataset of 500 samples, it maintained an accuracy of 93.8%, showcasing its strong performance. This system effectively addresses the challenge of scanning documents for sensitive information without dependence on cloud services or rule-based solutions.
Key facts
- TorchSight is an open-source local system for security document classification.
- It uses a fine-tuned Qwen 3.5 27B model.
- Trained on 78,358 samples from 13 permissively licensed sources and GPT-4 synthetic data.
- Covers seven security categories and 51 subcategories.
- Achieved 95.0% category-level accuracy on 1,000 documents (95% CI: 93.5-96.2).
- Commercial models scored 75.4-79.9% under the same prompting protocol.
- On 500 held-out samples, accuracy was 93.8%.
- Designed to avoid sending data to external cloud infrastructure.
Entities
—