VISD: Structured Self-Distillation for Video Reasoning

other · 2026-05-09

A novel approach named VISD (Video Structured self-Distillation) has been introduced to improve complex reasoning capabilities in Video Large Language Models (VideoLLMs). This technique tackles the difficulties associated with training these models for tasks that demand temporal grounding and logical coherence. While traditional reinforcement learning with verifiable rewards (RLVR) offers dependable supervision, it falls short in providing token-level credit, resulting in suboptimal learning. Current self-distillation techniques deliver dense supervision but often lack structure and diagnostic clarity, leading to unstable interactions with reinforcement learning. VISD features a video-aware judge model that breaks down reasoning quality into various dimensions, such as answer accuracy, logical coherence, and spatio-temporal grounding, facilitating a teacher policy for token-level credit assignment. This method seeks to enhance the training efficiency and effectiveness of VideoLLMs for intricate reasoning challenges. The paper can be found on arXiv with the identifier 2605.06094.

Key facts

VISD stands for Video Structured self-Distillation.
It is designed for training VideoLLMs in complex reasoning.
RLVR provides reliable supervision but lacks token-level credit assignment.
Existing self-distillation methods lack structure and diagnostic specificity.
VISD uses a video-aware judge model to decompose reasoning quality.
Dimensions include answer correctness, logical consistency, and spatio-temporal grounding.
The framework guides a teacher policy for token-level credit assignment.
The paper is on arXiv with ID 2605.06094.

VISD: Structured Self-Distillation for Video Reasoning

Key facts

Entities

Institutions

Sources