Naamah: New Sanskrit NER Dataset via DBpedia and LLMs
Researchers have introduced Naamah, a large-scale synthetic Sanskrit Named Entity Recognition (NER) dataset comprising 102,942 sentences. The methodology combines entity extraction from DBpedia with a 24B parameter hybrid reasoning model to generate grammatically natural and diverse training data. The dataset benchmarks two transformer architectures: XLM RoBERTa and IndicBERTv2. This work addresses the scarcity of annotated resources for classical Sanskrit literature digitization.
Key facts
- Naamah is a silver standard Sanskrit NER dataset with 102,942 sentences.
- Methodology uses DBpedia entity extraction and a 24B parameter hybrid reasoning model.
- Benchmarked on XLM RoBERTa and IndicBERTv2 transformer architectures.
- Aims to overcome scarcity of annotated resources for Sanskrit NLP.
- Focuses on classical Sanskrit literature digitization.
Entities
Institutions
- DBpedia
- XLM RoBERTa
- IndicBERTv2