TCM-Serve: Modality-Aware Scheduling for Multimodal LLM Inference

ai-technology · 2026-05-07

A novel system named TCM-Serve tackles the scheduling difficulties associated with serving multimodal large language models (MLLMs), such as ChatGPT, Gemini, and Copilot. These models process text, images, and videos, yet current LLM serving frameworks designed for text-only tasks experience head-of-line blocking and reduced performance when managing diverse multimodal requests. The crucial observation is that these requests have significantly different resource requirements: videos function like trucks, images like cars, and text like motorcycles. TCM-Serve acts as a modality-aware scheduler, allowing smaller requests (text) to be processed swiftly while preventing larger requests (images, videos) from being delayed. The system categorizes requests and modifies scheduling to ensure interactive responsiveness. The research is accessible on arXiv with the identifier 2603.26498.

Key facts

TCM-Serve is a modality-aware scheduler for multimodal large language model inference.
Multimodal requests differ by orders of magnitude in resource demands.
Videos are compared to trucks, images to cars, and text to motorcycles.
Existing LLM serving systems optimized for text-only workloads fail under multimodality.
Large requests like videos cause head-of-line blocking and performance degradation.
TCM-Serve prioritizes small requests to ensure interactive responsiveness.
The system avoids starvation of larger requests.
The paper is published on arXiv with ID 2603.26498.

TCM-Serve: Modality-Aware Scheduling for Multimodal LLM Inference

Key facts

Entities

Institutions

Sources