LatentRouter Predicts Multimodal Model Performance Before Execution
Researchers propose LatentRouter, a system that predicts how well a multimodal large language model (MLLM) will perform on a given image-question input before actually running the model. The router extracts multimodal routing capsules from the query and compares them with model capability tokens via latent communication, estimating counterfactual utility for each candidate MLLM. A distributional outcome head predicts model-specific quality, while a bounded capsule correction refines close decisions. The approach addresses the heterogeneous strengths of MLLMs across tasks like OCR, chart understanding, spatial reasoning, and visual question answering, aiming to optimize for both performance and cost/latency trade-offs. The paper is published on arXiv under ID 2605.11301.
Key facts
- LatentRouter formulates MLLM routing as counterfactual multimodal utility prediction.
- It extracts learned multimodal routing capsules from image-question queries.
- Each candidate MLLM is represented by a model capability token.
- Latent communication estimates how each model would perform if selected.
- A distributional outcome head predicts model-specific counterfactual quality.
- A bounded capsule correction refines close decisions without residual signal dominance.
- MLLMs have heterogeneous strengths across OCR, chart understanding, spatial reasoning, VQA, cost, and latency.
- The paper is arXiv:2605.11301.
Entities
Institutions
- arXiv