K-Token Merging: Latent-Space Compression for LLMs
A new framework called K-Token Merging has been introduced by researchers to enhance the efficiency of Large Language Models (LLMs) by compressing latent space, which helps to lower memory and computational demands when dealing with lengthy prompts. Unlike traditional methods that compress in token space, K-Token Merging functions within the latent embedding space, consolidating each contiguous block of K token embeddings into one through a simple encoder. The resulting compressed sequence is then utilized by a LoRA-adapted LLM, while the output maintains the original vocabulary. Tests on structural reasoning (Textualized Tree), sentiment analysis (Amazon Reviews), and code editing (CommitPackFT) indicate that K-Token Merging achieves up to 4x compression with minimal performance loss, positioning it on the Pareto frontier. The research is accessible on arXiv.
Key facts
- K-Token Merging compresses token embeddings in latent space.
- It merges contiguous blocks of K embeddings into one via a lightweight encoder.
- The compressed sequence is processed by a LoRA-adapted LLM.
- Generation remains in the original vocabulary.
- Experiments on Textualized Tree, Amazon Reviews, and CommitPackFT.
- Achieves up to 4x compression with minimal performance degradation.
- Lies on the Pareto frontier of performance vs. compression.
- Paper available on arXiv.
Entities
Institutions
- arXiv