SemanticZip: Lossy Text Compression via LLM Decompression

ai-technology · 2026-05-26

A new framework called SemanticZip proposes lossy text compression where an LLM decompresses compact codes into task-relevant meaning, rather than exact byte reconstruction. The pilot study formalizes LLM-mediated decompression with a protected/lossy packet architecture and evaluates six representation regimes—structured prose, JSON, CCL-Core, CCL-Min, SemanticZip ASCII, and SemanticZip emoji—over five author-constructed diagnostic cases. The approach treats model-based decompression as part of the codec, assessing recovery of semantic commitments rather than exact text. No benchmark claims are made; the paper serves as a proof-of-concept.

Key facts

SemanticZip is a lossy text compression framework using LLMs as semantic decompressors.
It does not require byte-identical reconstruction, unlike lossless compression.
The framework defines a protected/lossy packet architecture.
Six representation regimes are evaluated: structured prose, JSON, CCL-Core, CCL-Min, SemanticZip ASCII, and SemanticZip emoji.
Five author-constructed diagnostic cases are used.
An independent decoder LLM reconstructs typed semantic atoms from compressed codes.
The paper is a pilot framework, not a benchmark claim.
Published on arXiv with ID 2605.24541.

SemanticZip: Lossy Text Compression via LLM Decompression

Key facts

Entities

Institutions

Sources