Factual Consistency Metrics Fail for Long-Document Summarization
An analysis of six reference-free metrics for factuality indicates their inadequacy in summarizing lengthy documents. This research, available on arXiv (2511.07689v2), evaluates metrics intended for short summaries against seven types of factuality-preserving alterations: paraphrasing, simplification, synonym substitution, logically equivalent negations, vocabulary reduction, compression, and insertion of source text. Testing on three long-form benchmark datasets (science fiction, legal, scientific) reveals variable scores, underscoring the difficulties posed by input length constraints and long-range dependencies. The study examines the robustness of these metrics concerning retrieval context and claim information density, ultimately finding that metrics designed for short-form summaries yield inconsistent outcomes for longer texts.
Key facts
- Six reference-free factuality metrics were evaluated.
- Seven factuality-preserving perturbations were applied.
- Three long-form benchmark datasets were used: science fiction, legal, scientific.
- Metrics originally proposed for short-form summarization.
- Perturbations include paraphrasing, simplification, synonym replacement, logically equivalent negations, vocabulary reduction, compression, and source text insertion.
- Results show inconsistent scores for long documents.
- Study probes sensitivity to retrieval context and claim information density.
- Published on arXiv with ID 2511.07689v2.
Entities
Institutions
- arXiv