FORCEBENCH: A Stress Test for Citation Laundering in RAG

publication · 2026-05-28

A recent study presents FORCEBENCH, a contrastive stress test aimed at identifying citation laundering within Retrieval-Augmented Generation (RAG) systems. This phenomenon occurs when a relevant source is misrepresented to support an exaggerated claim, a flaw often overlooked by standard evaluation metrics. The benchmark combines a fixed cited passage with an evidence-calibrated claim and a localized force-raised variant, assessing five dimensions: relation, modality, scope, temporal validity, and numeric specificity. A properly calibrated evaluator should assign a higher score to the evidence-calibrated claim. Testing on a set of 198 pairs reveals that token and entity overlap breaches monotonicity in 32.8–36.4% of instances, and generic support prompting proves inadequate across four model judges. The paper can be found on arXiv with ID 2605.28044.

Key facts

FORCEBENCH is a contrastive stress test for evidence-force calibration in RAG.
Citation laundering presents a related source as warrant for an over-strong claim.
The benchmark uses five operational axes: relation, modality, scope, temporal validity, and numeric specificity.
A 198-pair evaluation set was used for headline experiments.
Token and entity overlap violate monotonicity on 32.8–36.4% of pairs.
Four model judges were tested with generic support prompting.
The paper is published on arXiv with ID 2605.28044.
Citation-presence sanity checks are uninformative by design.

FORCEBENCH: A Stress Test for Citation Laundering in RAG

Key facts

Entities

Institutions

Sources