Study Compares 14 Representations of Retrieved Content in RAG Pipelines

other · 2026-06-01

A new study from arXiv (2605.30790) systematically compares how different representations of retrieved documents affect large language model (LLM) performance in retrieval-augmented generation (RAG) pipelines. The researchers held retrieval fixed and varied only the representation of retrieved documents, testing 14 transformations including selection, summarisation, and reformulation, in both query-dependent and query-independent variants. They measured question-answering accuracy across these representations, addressing the gap in understanding which features of a document's representation matter most when the consumer is an LLM rather than a human. The work builds on prior research that examined single transformations in isolation, providing a controlled comparison to identify the most impactful representation strategies.

Key facts

Study compares 14 representations of retrieved documents in RAG pipelines
Held retrieval fixed, varied only representation
Transformations include selection, summarisation, reformulation
Tested query-dependent and query-independent variants
Measured question-answering accuracy
Addresses gap in understanding LLM-specific content representation
Builds on prior isolated studies of single transformations
Published on arXiv with ID 2605.30790

Study Compares 14 Representations of Retrieved Content in RAG Pipelines

Key facts

Entities

Institutions

Sources