Cross-Stage Coherence in Hierarchical Driving VQA

other · 2026-04-27

This research dives into how cross-stage context transfer works for Graph Visual Question Answering (GVQA) in the context of self-driving cars. It uses the DriveLM-nuScenes dataset and examines two different methods. The first method is explicit, applying three prompt-based strategies on a domain-tailored 4B VLM called Mini-InternVL2-4B-DA-DriveLM, which notably reduces NLI contradictions by up to 42.6% without any extra training. The second method is more implicit, using gated context projectors to pull hidden-state vectors and merge normalized projections into the next stage's input embeddings. These projectors are trained with stage-specific QLoRA adapters on a general 8B VLM (InternVL3-8B-Instruct), changing only about 0.3% of the parameters. This study establishes strong benchmarks for cross-stage coherence in hierarchical driving VQA.

Key facts

Study on cross-stage context passing for GVQA in autonomous driving
Uses DriveLM-nuScenes dataset
Explicit variant uses prompt-based conditioning on Mini-InternVL2-4B-DA-DriveLM
Reduces NLI contradiction by up to 42.6% without additional training
Implicit variant introduces gated context projectors
Projectors inject normalized, gated projections into next-stage input embeddings
Jointly trained with QLoRA adapters on InternVL3-8B-Instruct
Updates only approximately 0.3% of parameters

Cross-Stage Coherence in Hierarchical Driving VQA

Key facts

Entities

Institutions

Sources