FAR Framework Optimizes Transformer Attention for ReRAM Accelerators

other · 2026-05-18

A new framework called FAR (Function-preserving Attention Replacement) has been introduced by researchers to replace attention mechanisms in pretrained DeiT vision transformers with sequential modules that work with in-memory computing (IMC) devices. This method substitutes self-attention with a multi-head bidirectional LSTM architecture through block-wise distillation, allowing for linear-time computation and efficient weight reuse. FAR effectively mitigates the latency and bandwidth issues associated with activation-to-activation multiplications and non-local memory access on ReRAM-based accelerators. Additionally, structured pruning is utilized to tailor models for resource-limited IMC arrays while preserving functional integrity. Evaluations conducted on the DeiT family highlight the framework's effectiveness.

Key facts

FAR replaces attention in pretrained DeiTs with sequential modules for IMC compatibility
Self-attention is replaced by multi-head bidirectional LSTM via block-wise distillation
Enables linear-time computation and localized weight reuse
Structured pruning allows adaptation to resource-constrained IMC arrays
Evaluated on the DeiT family of vision transformers
Addresses latency and bandwidth overhead on ReRAM accelerators
Published on arXiv with ID 2505.21535
Announce type: replace-cross

FAR Framework Optimizes Transformer Attention for ReRAM Accelerators

Key facts

Entities

Institutions

Sources