ARTFEED — Contemporary Art Intelligence

Naamah: New Sanskrit NER Dataset via DBpedia and LLMs

other · 2026-04-30

Researchers have introduced Naamah, a large-scale synthetic Sanskrit Named Entity Recognition (NER) dataset comprising 102,942 sentences. The methodology combines entity extraction from DBpedia with a 24B parameter hybrid reasoning model to generate grammatically natural and diverse training data. The dataset benchmarks two transformer architectures: XLM RoBERTa and IndicBERTv2. This work addresses the scarcity of annotated resources for classical Sanskrit literature digitization.

Key facts

  • Naamah is a silver standard Sanskrit NER dataset with 102,942 sentences.
  • Methodology uses DBpedia entity extraction and a 24B parameter hybrid reasoning model.
  • Benchmarked on XLM RoBERTa and IndicBERTv2 transformer architectures.
  • Aims to overcome scarcity of annotated resources for Sanskrit NLP.
  • Focuses on classical Sanskrit literature digitization.

Entities

Institutions

  • DBpedia
  • XLM RoBERTa
  • IndicBERTv2

Sources