XFP: Dynamic Weight Quantizer for Efficient LLM Inference

ai-technology · 2026-05-16

XFP has introduced a groundbreaking weight quantization method for large language models that redefines existing strategies. This approach eliminates the necessity for manual bit-width selection and calibration data, instead establishing quality metrics grounded in cosine similarity for individual channels. It maintains strict criteria for attention and shared experts while allowing greater flexibility in routed-expert mixture-of-experts scenarios. The technique independently handles codebook size, outlier allocation, and layer arrangement, utilizing a unique configuration that separates weight matrices into sparse outliers and dense indices. During testing with Qwen3.5-122B-A10B, it achieved a remarkable 138 tokens per second, outperforming Marlin INT4 by 49%.

Key facts

XFP is a dynamic weight quantizer for LLM inference.
Operator specifies reconstruction quality floors on per-channel cosine similarity.
Strict floor for attention and shared experts; lazy floor for routed-expert MoE.
XFP automatically determines codebook size, outlier budget, and packing per layer.
No Hessian, calibration data, or manual bit-width selection required.
Weight matrix decomposed into sparse fp16 outlier residual and dense sub-byte index tensor.
Two storage modes: V2 (per-channel Lloyd) and V2a (shared library of L=32 codebooks per layer).
On Qwen3.5-122B-A10B, XFP achieves 138 tok/s on RTX PRO 6000 Blackwell at TP=2 with 94.49% GSM8K strict-match.

Entities

—

Sources

arXiv cs.AI — 2026-05-16