PIPO: Unifying Latent Compression and Multi-Token Prediction for Efficient LLM Decoding
Researchers have introduced a novel technique called Pair-In, Pair-Out (PIPO), which integrates latent compression with multi-token prediction to lower the inference expenses associated with autoregressive decoding in large language models. In this approach, a latent compressor and an MTP head function as complementary processes: the compressor merges two input tokens into a single latent representation, whereas the MTP head expands one hidden state into an additional output token. To avoid the costly verifier pass, PIPO employs a streamlined confidence head that determines the acceptance of draft tokens. This method effectively bridges the gap between input-side and output-side techniques, providing a cohesive solution for enhancing LLM inference efficiency.
Key facts
- PIPO unifies latent compression and multi-token prediction.
- Compressor folds two input tokens into one latent representation.
- MTP head unfolds one hidden state into one additional output token.
- Lightweight confidence head replaces expensive verifier pass.
- Method targets autoregressive decoding inference cost.
- Proposed in arXiv paper 2605.27255.
- Addresses independent development of input and output side methods.
- On-Po observation mentioned but not detailed.
Entities
Institutions
- arXiv