Table Retrieval Instability Across Serializations Addressed by Centroid Averaging
A recent investigation published on arXiv (2604.24040) indicates that transformer-based systems for table retrieval exhibit significant sensitivity to serialization formats. When structured tables are converted into token sequences, formats that are semantically similar, such as CSV, TSV, HTML, Markdown, and DDL, yield markedly different embeddings and retrieval outcomes across various benchmarks and retriever types. The researchers suggest viewing serialization embeddings as noisy representations of a unified semantic signal, advocating for the use of their centroid as a standard target representation. By averaging centroids, format-specific variations are minimized, allowing for the recovery of semantic content shared among different serializations, particularly when format-induced changes vary across tables. Centroid representations demonstrate superior performance compared to individual formats in aggregate pairwise evaluations across MPNet and other retriever models.
Key facts
- arXiv paper 2604.24040 addresses table retrieval instability
- Transformer-based systems flatten tables into token sequences
- Semantically equivalent serializations (CSV, TSV, HTML, Markdown, DDL) produce different embeddings
- Instability observed across multiple benchmarks and retriever families
- Proposed method uses centroid averaging of serialization embeddings
- Centroid suppresses format-specific variation
- Centroid outperforms individual formats in pairwise comparisons
- Method tested on MPNet and other retrievers
Entities
Institutions
- arXiv