New AI Architecture Combines Lexical and Dense Retrieval for Dataset Search
A novel reference architecture for agentic hybrid retrieval in dataset search has been introduced to tackle the issue of aligning underspecified natural-language queries with sparse and diverse metadata records. This method reframes dataset search as a software-architecture challenge, presenting a bounded, auditable system that merges BM25 lexical search with dense-embedding retrieval using reciprocal rank fusion (RRF). A large language model (LLM) agent orchestrates the process by planning queries, assessing the adequacy of results, and reranking options. To address vocabulary discrepancies between user intent and metadata created by providers, an offline metadata augmentation phase is incorporated, where an LLM produces pseudo-queries for each dataset record, improving retrieval indexes prior to query execution. Two architectural styles are explored: a single ReAct agent and a multi-agent horizontal architecture. This research is detailed in the arXiv preprint 2604.16394v1, which was announced as a cross submission.
Key facts
- The architecture addresses ad hoc dataset search with underspecified natural-language queries.
- It combines BM25 lexical search with dense-embedding retrieval using reciprocal rank fusion (RRF).
- An LLM agent orchestrates query planning, result evaluation, and candidate reranking.
- Offline metadata augmentation involves LLM-generated pseudo-queries for dataset records.
- Two architectural styles are examined: single ReAct agent and multi-agent horizontal architecture.
- The work is documented in arXiv preprint 2604.16394v1.
- The announcement type is cross.
- The goal is to reduce vocabulary mismatch between user intent and provider-authored metadata.
Entities
Institutions
- arXiv