Block Attention Generalized via SemanticSeg Dataset and Block Distillation
Researchers propose a method to generalize block attention for long-context scenarios like RAG. They created SemanticSeg, a dataset of over 30k instances across 16 categories (books, code, web text, conversations) with text lengths from 2k to 32k tokens. A lightweight segmenter is trained to partition text into human-aligned blocks. Block distillation is introduced as an efficient training framework that avoids performance degradation. The work addresses segmentation difficulty and fine-tuning inefficiency.
Key facts
- SemanticSeg dataset contains over 30k instances across 16 categories
- Text lengths range from 2k to 32k tokens
- Categories include books, code, web text, and conversations
- A lightweight segmenter is trained for automatic text partitioning
- Block distillation is proposed as a more efficient training framework
- The method targets KV cache reuse in long-context RAG scenarios
- Block attention processes input as separate non-attending blocks
- The approach aims to overcome segmentation and fine-tuning challenges
Entities
Institutions
- arXiv