FORCEBENCH: A Stress Test for Citation Laundering in RAG
A recent study presents FORCEBENCH, a contrastive stress test aimed at identifying citation laundering within Retrieval-Augmented Generation (RAG) systems. This phenomenon occurs when a relevant source is misrepresented to support an exaggerated claim, a flaw often overlooked by standard evaluation metrics. The benchmark combines a fixed cited passage with an evidence-calibrated claim and a localized force-raised variant, assessing five dimensions: relation, modality, scope, temporal validity, and numeric specificity. A properly calibrated evaluator should assign a higher score to the evidence-calibrated claim. Testing on a set of 198 pairs reveals that token and entity overlap breaches monotonicity in 32.8–36.4% of instances, and generic support prompting proves inadequate across four model judges. The paper can be found on arXiv with ID 2605.28044.
Key facts
- FORCEBENCH is a contrastive stress test for evidence-force calibration in RAG.
- Citation laundering presents a related source as warrant for an over-strong claim.
- The benchmark uses five operational axes: relation, modality, scope, temporal validity, and numeric specificity.
- A 198-pair evaluation set was used for headline experiments.
- Token and entity overlap violate monotonicity on 32.8–36.4% of pairs.
- Four model judges were tested with generic support prompting.
- The paper is published on arXiv with ID 2605.28044.
- Citation-presence sanity checks are uninformative by design.
Entities
Institutions
- arXiv