DMI-Lib: A High-Speed Deep Model Inspector for LLM Inference

ai-technology · 2026-05-13

DMI-Lib is a rapid deep model inspector that prioritizes internal observability as a crucial component for LLM inference. It separates observability from the inference path by utilizing an asynchronous framework based on Ring^2, which serves as a GPU-CPU memory abstraction for capturing and staging tensors, alongside a policy-driven host backend for export. This tool enables the strategic placement of observation points across a wide range of internal signals and various inference backends, all while maintaining serving optimizations within strict GPU memory limits. Tests indicate that DMI-Lib results in only 0.4%–6.8% overhead during offline batch inference and averages 6% in moderate online serving, achieving a latency reduction of 2x–15x compared to current benchmarks. The library is available as open-source at https://github.com.

Key facts

DMI-Lib is a high-speed deep model inspector for LLM inference.
It treats internal observability as a first-class systems primitive.
It decouples observability from the inference hot path via an asynchronous substrate.
The substrate is built from Ring^2, a GPU-CPU memory abstraction.
It uses a policy-controlled host backend to export tensors.
DMI-Lib enables observation points across internal signals and inference backends.
It preserves serving optimizations and adheres to GPU memory budgets.
Overhead is 0.4%–6.8% in offline batch inference and 6% in online serving.
Latency overhead is reduced by 2x–15x compared to baselines.
DMI-Lib is open-sourced at https://github.com.

Entities

—

Sources

arXiv cs.AI — 2026-05-13