Study Compares 14 Representations of Retrieved Content in RAG Pipelines
A new study from arXiv (2605.30790) systematically compares how different representations of retrieved documents affect large language model (LLM) performance in retrieval-augmented generation (RAG) pipelines. The researchers held retrieval fixed and varied only the representation of retrieved documents, testing 14 transformations including selection, summarisation, and reformulation, in both query-dependent and query-independent variants. They measured question-answering accuracy across these representations, addressing the gap in understanding which features of a document's representation matter most when the consumer is an LLM rather than a human. The work builds on prior research that examined single transformations in isolation, providing a controlled comparison to identify the most impactful representation strategies.
Key facts
- Study compares 14 representations of retrieved documents in RAG pipelines
- Held retrieval fixed, varied only representation
- Transformations include selection, summarisation, reformulation
- Tested query-dependent and query-independent variants
- Measured question-answering accuracy
- Addresses gap in understanding LLM-specific content representation
- Builds on prior isolated studies of single transformations
- Published on arXiv with ID 2605.30790
Entities
Institutions
- arXiv