Patent Retrieval Benchmark and Embedding Model: Sophia-Bench and QaECTER

ai-technology · 2026-04-29

To tackle the deficiency of varied benchmarks in patent search, a new benchmark named Sophia-bench and a 344M-parameter embedding model called QaECTER have been launched. Sophia-bench features 10,000 queries and 75,000 corpus documents collected over a decade, covering eight IPC technology sections and twelve filing jurisdictions. It evaluates retrieval effectiveness across 12 query types, including structured patent fields and AI-generated summaries, utilizing a citation-based ground truth supplemented by a domain-relevance metric known as InScope. QaECTER, which is trained on patent citations, aims to enhance the quality of embeddings. This initiative seeks to foster innovation, improve examination processes, and inform IP strategy decisions.

Key facts

Sophia-bench contains 10,000 queries and 75,000 corpus documents.
Benchmark spans ten years, eight IPC technology sections, and twelve filing jurisdictions.
Tests retrieval using 12 different query types.
Uses citation-based ground truth with InScope metric.
QaECTER is a 344M-parameter embedding model.
Model trained on patent citations.
Addresses lack of diverse benchmarks in patent retrieval.
Aims to improve innovation, examination, and IP strategy.

Entities

—

Sources

arXiv cs.AI — 2026-04-28