Forge-UGC Compiler Framework Optimizes Transformer Deployment on Heterogeneous Hardware
Forge-UGC has unveiled a four-phase compiler framework aimed at enhancing the deployment of transformer models on various accelerator hardware, specifically tested on Intel AI Boost NPU. This new system tackles the shortcomings of current frameworks such as OpenVINO and ONNX Runtime, which often feature unclear compilation processes, limited pass-level visibility, and poor buffer management, resulting in higher compilation expenses and runtime delays. Its hardware-agnostic design divides graph capture, optimization, intermediate representation lowering, and backend scheduling into separate phases. In Phase 1, computational graphs are captured using torch.export at the ATen operator level, supporting advanced transformer features like rotary position embeddings and grouped-query attention without manual decomposition. Phase 2 introduces six essential optimization passes, including dead code elimination and attention fusion. The universal graph compilation strategy aims to boost performance and minimize overhead for contemporary AI models. This research is detailed in the arXiv preprint 2604.16498v1, classified as a cross-announcement. By enhancing transparency and control in the compilation process, Forge-UGC aspires to increase efficiency for AI developers utilizing transformer architectures on specialized hardware.
Key facts
- Forge-UGC is a four-phase compiler for transformer deployment on heterogeneous accelerator hardware.
- It was validated on Intel AI Boost NPU.
- It addresses limitations in existing frameworks like OpenVINO and ONNX Runtime.
- The compiler uses a hardware-agnostic design separating graph capture, optimization, IR lowering, and backend scheduling.
- Phase 1 captures graphs with torch.export at the ATen operator level.
- Phase 1 supports transformer components like rotary position embeddings, grouped-query attention, and SwiGLU without manual decomposition.
- Phase 2 applies six optimization passes including dead code elimination and attention fusion.
- The research is documented in arXiv preprint 2604.16498v1.
Entities
Institutions
- Intel