Table Retrieval Instability Across Serializations Addressed by Centroid Averaging

publication · 2026-04-29

A recent investigation published on arXiv (2604.24040) indicates that transformer-based systems for table retrieval exhibit significant sensitivity to serialization formats. When structured tables are converted into token sequences, formats that are semantically similar, such as CSV, TSV, HTML, Markdown, and DDL, yield markedly different embeddings and retrieval outcomes across various benchmarks and retriever types. The researchers suggest viewing serialization embeddings as noisy representations of a unified semantic signal, advocating for the use of their centroid as a standard target representation. By averaging centroids, format-specific variations are minimized, allowing for the recovery of semantic content shared among different serializations, particularly when format-induced changes vary across tables. Centroid representations demonstrate superior performance compared to individual formats in aggregate pairwise evaluations across MPNet and other retriever models.

Key facts

arXiv paper 2604.24040 addresses table retrieval instability
Transformer-based systems flatten tables into token sequences
Semantically equivalent serializations (CSV, TSV, HTML, Markdown, DDL) produce different embeddings
Instability observed across multiple benchmarks and retriever families
Proposed method uses centroid averaging of serialization embeddings
Centroid suppresses format-specific variation
Centroid outperforms individual formats in pairwise comparisons
Method tested on MPNet and other retrievers

Table Retrieval Instability Across Serializations Addressed by Centroid Averaging

Key facts

Entities

Institutions

Sources