DMI-Lib: A High-Speed Deep Model Inspector for LLM Inference
DMI-Lib is a rapid deep model inspector that prioritizes internal observability as a crucial component for LLM inference. It separates observability from the inference path by utilizing an asynchronous framework based on Ring^2, which serves as a GPU-CPU memory abstraction for capturing and staging tensors, alongside a policy-driven host backend for export. This tool enables the strategic placement of observation points across a wide range of internal signals and various inference backends, all while maintaining serving optimizations within strict GPU memory limits. Tests indicate that DMI-Lib results in only 0.4%–6.8% overhead during offline batch inference and averages 6% in moderate online serving, achieving a latency reduction of 2x–15x compared to current benchmarks. The library is available as open-source at https://github.com.
Key facts
- DMI-Lib is a high-speed deep model inspector for LLM inference.
- It treats internal observability as a first-class systems primitive.
- It decouples observability from the inference hot path via an asynchronous substrate.
- The substrate is built from Ring^2, a GPU-CPU memory abstraction.
- It uses a policy-controlled host backend to export tensors.
- DMI-Lib enables observation points across internal signals and inference backends.
- It preserves serving optimizations and adheres to GPU memory budgets.
- Overhead is 0.4%–6.8% in offline batch inference and 6% in online serving.
- Latency overhead is reduced by 2x–15x compared to baselines.
- DMI-Lib is open-sourced at https://github.com.
Entities
—