N-vium: Mixture-of-Exits Transformer Boosts Inference Speed

other · 2026-05-14

The recently introduced transformer architecture, N-vium, detailed in arXiv:2605.13190, enhances autoregressive generation by enabling parallel computation across different depths. Unlike traditional techniques that sacrifice quality to lower FLOPs per token, N-vium boosts effective FLOPs per second through a mixture-of-exits strategy. It incorporates prediction heads at various depths and formulates the next-token distribution as a learned mixture with token-adaptive routing. This approach extends the conventional transformer model, which can revert to its standard form when routing assigns zero mass to intermediate heads. With sampling being precise and KV caches restored by postponing upper-layer computations, the largest pretrained model, with 1.5B parameters, achieves a 57.9% speedup in wall-clock time compared to a parameter-matched baseline.

Key facts

N-vium is a mixture-of-exits transformer for accelerated exact generation.
It partially parallelizes computation across depth on standard hardware.
It increases effective FLOPs per second rather than minimizing compute per token.
Prediction heads are attached at multiple depths.
Next-token distribution is a learned mixture with token-adaptive routing.
The formulation strictly generalizes the standard transformer.
Sampling from the mixture is exact.
The largest model reaches 57.9% wall-clock speedup at 1.5B parameters.

Entities

—

Sources

arXiv cs.AI — 2026-05-14