Clark Hash: Efficient Neural Embedding Compression via Sparse Projection

ai-technology · 2026-05-28

A novel technique known as Clark Hash achieves a 32x compression of neural embeddings without the need for training. This approach normalizes vectors, utilizes a deterministic sparse signed Johnson-Lindenstrauss projection, clips the output, and saves scalar-quantized codes. In the standard 384-dimensional sentence-embedding configuration, it decreases storage from 1536 bytes (dense f32) to just 48 bytes. The method does not require any training passes, learned codebooks, rotations, or corpus statistics. Queries are maintained in floating point and are evaluated against the stored sketches. A multilingual sentence-similarity assessment on 9,304 labeled pairs from 29 subsets, employing a multilingual MiniLM encoder, achieved macro Pearson correlations of 0.910 and 0.946 with dense cosine scores on STS17 and STS22. The paper details the codec and includes a Rust implementation.

Key facts

Clark Hash compresses neural embeddings by 32x.
Default 384-dimensional embeddings stored in 48 bytes vs 1536 bytes.
No training pass or learned codebooks required.
Uses deterministic sparse signed Johnson-Lindenstrauss projection.
Evaluated on 9,304 labeled pairs from 29 subsets.
Multilingual MiniLM encoder used in evaluation.
Achieved 0.910 macro Pearson on STS17.
Achieved 0.946 macro Pearson on STS22.

Entities

—

Sources

arXiv cs.AI — 2026-05-28