StructSense: AI Framework for Structured Info Extraction from Scientific Literature
A new open-source framework named StructSense has been developed by researchers to extract structured data from scientific literature in a modular and task-agnostic manner. This framework enhances domain-specific extraction by incorporating ontology-driven symbolic knowledge, self-evaluative refinement, and human validation. StructSense was tested on three tasks with varying semantic complexities, achieving 91–100% accuracy in schema-based extraction of assessment tools, 86–93% in overall metadata and resource extraction from scientific articles, and 58–75% accuracy in named entity recognition (NER) from neuroscience texts involving 8,882 entities. In two biomedical NER benchmarks, NCBI Disease and S800 Species, it recorded ≥90% relaxed recall and 62.5% exact match. This research is available on arXiv, reference 2507.03674.
Key facts
- StructSense is a modular, task-agnostic, open-source framework.
- It integrates ontology-guided symbolic knowledge, agentic self-evaluative refinement, and human-in-the-loop validation.
- Achieved 91–100% accuracy on schema-based extraction of assessment instruments.
- Achieved 86–93% overall on metadata and resource extraction from scientific papers.
- Achieved 58–75% label accuracy on NER from neuroscience literature across 8,882 entities.
- On NCBI Disease and S800 Species benchmarks, achieved ≥90% relaxed recall and 62.5% exact match.
- Published on arXiv under reference 2507.03674.
- Addresses LLM limitations in specialized domains.
Entities
Institutions
- arXiv