Structure-Preserving Retrieval for Semi-Structured Documents
A new retrieval pipeline from arXiv (2604.20849) addresses the mismatch between tree-structured documents like HTML and flat embedding models. The SPIRE system represents candidates as subdocuments that preserve structural identity, using document primitives such as paths, path sets, and pruning. It introduces global and local contextualization to maintain interpretability. The method aims to improve citation-ready evidence extraction from semi-structured sources.
Key facts
- Paper ID: arXiv:2604.20849
- Announce Type: cross
- Focus on retrieval-augmented generation over semi-structured sources
- Proposes structure-aware retrieval pipeline
- Core concept: subdocuments as addressable selections
- Defines document primitives: paths, path sets, subdocument extraction by pruning
- Two contextualization mechanisms: global and local
- Addresses mismatch between document structure and flat sequence-based models
Entities
Institutions
- arXiv