ARTFEED — Contemporary Art Intelligence

FORCEBENCH: A Stress Test for Citation Laundering in RAG

publication · 2026-05-28

A recent study presents FORCEBENCH, a contrastive stress test aimed at identifying citation laundering within Retrieval-Augmented Generation (RAG) systems. This phenomenon occurs when a relevant source is misrepresented to support an exaggerated claim, a flaw often overlooked by standard evaluation metrics. The benchmark combines a fixed cited passage with an evidence-calibrated claim and a localized force-raised variant, assessing five dimensions: relation, modality, scope, temporal validity, and numeric specificity. A properly calibrated evaluator should assign a higher score to the evidence-calibrated claim. Testing on a set of 198 pairs reveals that token and entity overlap breaches monotonicity in 32.8–36.4% of instances, and generic support prompting proves inadequate across four model judges. The paper can be found on arXiv with ID 2605.28044.

Key facts

  • FORCEBENCH is a contrastive stress test for evidence-force calibration in RAG.
  • Citation laundering presents a related source as warrant for an over-strong claim.
  • The benchmark uses five operational axes: relation, modality, scope, temporal validity, and numeric specificity.
  • A 198-pair evaluation set was used for headline experiments.
  • Token and entity overlap violate monotonicity on 32.8–36.4% of pairs.
  • Four model judges were tested with generic support prompting.
  • The paper is published on arXiv with ID 2605.28044.
  • Citation-presence sanity checks are uninformative by design.

Entities

Institutions

  • arXiv

Sources