DASH: Fast Differentiable Search for Hybrid Attention in LLMs

ai-technology · 2026-05-22

DASH, a novel and rapid differentiable search framework, has been developed for designing hybrid attention architectures in large language models (LLMs). These hybrid architectures, which integrate various attention mechanisms, are crucial for enhancing inference efficiency without compromising model quality. Current approaches often depend on manual guidelines or proxy selectors, while recent NAS-style methods like Jet-Nemotron necessitate 200 billion tokens for their search processes, rendering them impractical for everyday application. DASH transforms the discrete placement of attention operators into continuous architecture logits, generates reusable teacher-aligned linear candidates, and conducts architecture-only searches using fixed model and operator weights, thereby significantly lowering computational expenses. This framework allows for hybrid architecture searches within minutes on a single GPU, promoting broader adoption and addressing a key challenge in LLM design by automating the distribution of attention operators across layers.

Key facts

DASH is a fast differentiable search framework for hybrid attention architecture design.
Hybrid attention architectures improve LLM inference efficiency while preserving model quality.
Existing designs rely on manual rules or proxy-based selectors.
Jet-Nemotron's PostNAS search uses 200 billion tokens.
DASH relaxes discrete operator placement into continuous architecture logits.
DASH prepares reusable teacher-aligned linear candidates.
DASH performs architecture-only search with frozen model and operator weights.
DASH enables search in minutes on a single GPU.

Entities

—

Sources

arXiv cs.AI — 2026-05-21