FAR Framework Optimizes Transformer Attention for ReRAM Accelerators
A new framework called FAR (Function-preserving Attention Replacement) has been introduced by researchers to replace attention mechanisms in pretrained DeiT vision transformers with sequential modules that work with in-memory computing (IMC) devices. This method substitutes self-attention with a multi-head bidirectional LSTM architecture through block-wise distillation, allowing for linear-time computation and efficient weight reuse. FAR effectively mitigates the latency and bandwidth issues associated with activation-to-activation multiplications and non-local memory access on ReRAM-based accelerators. Additionally, structured pruning is utilized to tailor models for resource-limited IMC arrays while preserving functional integrity. Evaluations conducted on the DeiT family highlight the framework's effectiveness.
Key facts
- FAR replaces attention in pretrained DeiTs with sequential modules for IMC compatibility
- Self-attention is replaced by multi-head bidirectional LSTM via block-wise distillation
- Enables linear-time computation and localized weight reuse
- Structured pruning allows adaptation to resource-constrained IMC arrays
- Evaluated on the DeiT family of vision transformers
- Addresses latency and bandwidth overhead on ReRAM accelerators
- Published on arXiv with ID 2505.21535
- Announce type: replace-cross
Entities
Institutions
- arXiv