Fluxion: Hybrid Sparse Attention for Long-Context Inference on CPU-GPU Systems
Fluxion is a cutting-edge framework aimed at improving long-context inference by blending hybrid sparse attention with simultaneous processing on both CPU and GPU. It addresses the limitations of KV states during decoding that exceed GPU memory capacity, along with the difficulties of prefill-decode systems that rely on host memory for KV data. Notable aspects of Fluxion include output-aware KV budgeting, tailored sparse configurations for specific heads, and synchronized operations across devices. The framework also features a lightweight head-property predictor, a selector for granularity budgets, and a scheduler that prioritizes tasks to enhance performance. For more information, check out arXiv preprint 2605.07719.
Key facts
- Fluxion targets long-context inference with CPU-resident KV caches.
- It uses output-aware KV budgeting.
- It employs head-specific and granularity-aware sparse configuration.
- It enables cross-device coordinated execution.
- Components include a head-property predictor, granularity-budget selector, and priority-based scheduler.
- The paper is available on arXiv with ID 2605.07719.
- It addresses PCIe bandwidth and GPU idle time bottlenecks.
- The system is designed for disaggregated prefill-decode systems.
Entities
Institutions
- arXiv