Format-Constraint Coupling Reduces Knowledge Graph Fidelity on Statistical Tables
A recent investigation published on arXiv (2605.21974) indicates that merging serialization formats with schema restrictions can significantly undermine the accuracy of knowledge graphs when retrieving information from statistical CSV tables. The study focused on country-by-year time-series matrices sourced from open-data platforms, revealing a super-additive interaction effect: the combined influence of format and schema surpasses their individual effects by as much as +1.180, determined through a 2x2 factorial design involving 6 datasets. Bootstrap 95% confidence intervals were positively significant for 4 out of 6 datasets, especially in wide Type-II matrices. Notably, applying a schema to an incompatible format can lead to severe failures, reducing fact coverage below the unconstrained baseline in 4 out of 6 datasets due to entity inflation or extraction refusal, a phenomenon termed "format-constraint coupling." Supporting evidence from probing and token ablation experiments points to a surface-form anchoring explanation based on column-name references. Controlled variations across different format-schema combinations, GraphRAG hosts, and LLM families demonstrate the robustness of this effect.
Key facts
- arXiv paper 2605.21974 studies format-constraint coupling in knowledge graph construction from statistical tables.
- Country-by-year time-series matrices from open-data portals are the focus.
- Format and schema constraints interact super-additively, with joint effect exceeding sum of independent effects by up to +1.180.
- 2x2 factorial design used across 6 datasets.
- Bootstrap 95% CIs strictly positive on 4/6 datasets.
- Strongest evidence on wide Type-II matrices.
- Mismatched format and schema can cause catastrophic failure, reducing fact coverage below baseline on 4/6 datasets.
- Failures occur through entity inflation or extraction refusal.
- Surface-form anchoring explanation centered on column-name references is supported by probing and token ablation.
- Controlled variants across format-schema pairings, GraphRAG hosts, and LLM families confirm robustness.
Entities
Institutions
- arXiv