TRINE FPGA Accelerator Outperforms RTX 4090 in Multimodal AI Inference

ai-technology · 2026-06-01

TRINE serves as a single-bitstream FPGA accelerator and compiler designed for seamless multimodal inference without the need for reconfiguration. It consolidates layers into DDMM/SDDMM/SpMM formats and allocates them to a mode-switchable engine, which can switch at runtime between weight/output-stationary systolic, 1xCS SIMD, and a routable adder tree on a shared PE array. A two-stage top-k unit, matched in width, facilitates in-stream token pruning, while layer offloading that considers dependencies allows for the overlap of independent kernels across reconfigurable processing units. Tested on Alveo U50 and ZCU104, TRINE achieves latency reductions of up to 22.57x compared to the RTX 4090 and 6.86x against the Jetson Orin Nano at 20-21 W; token pruning alone can enhance performance by up to 7.8x in ViT-heavy pipelines.

Key facts

TRINE is a single-bitstream FPGA accelerator and compiler for multimodal inference.
It executes end-to-end multimodal inference without reconfiguration.
Layers are unified as DDMM/SDDMM/SpMM.
The engine toggles at runtime among three modes: weight/output-stationary systolic, 1xCS SIMD, and routable adder tree.
A two-stage top-k unit enables in-stream token pruning.
Dependency-aware layer offloading (DALO) overlaps independent kernels.
Evaluated on Alveo U50 and ZCU104 FPGAs.
TRINE reduces latency by up to 22.57x vs. RTX 4090 and 6.86x vs. Jetson Orin Nano at 20-21 W.
Token pruning yields up to 7.8x speedup on ViT-heavy pipelines.

Entities

—

Sources

arXiv cs.AI — 2026-06-01