SemanticZip: Lossy Text Compression via LLM Decompression
A new framework called SemanticZip proposes lossy text compression where an LLM decompresses compact codes into task-relevant meaning, rather than exact byte reconstruction. The pilot study formalizes LLM-mediated decompression with a protected/lossy packet architecture and evaluates six representation regimes—structured prose, JSON, CCL-Core, CCL-Min, SemanticZip ASCII, and SemanticZip emoji—over five author-constructed diagnostic cases. The approach treats model-based decompression as part of the codec, assessing recovery of semantic commitments rather than exact text. No benchmark claims are made; the paper serves as a proof-of-concept.
Key facts
- SemanticZip is a lossy text compression framework using LLMs as semantic decompressors.
- It does not require byte-identical reconstruction, unlike lossless compression.
- The framework defines a protected/lossy packet architecture.
- Six representation regimes are evaluated: structured prose, JSON, CCL-Core, CCL-Min, SemanticZip ASCII, and SemanticZip emoji.
- Five author-constructed diagnostic cases are used.
- An independent decoder LLM reconstructs typed semantic atoms from compressed codes.
- The paper is a pilot framework, not a benchmark claim.
- Published on arXiv with ID 2605.24541.
Entities
Institutions
- arXiv