Cross-Stage Coherence in Hierarchical Driving VQA
This research dives into how cross-stage context transfer works for Graph Visual Question Answering (GVQA) in the context of self-driving cars. It uses the DriveLM-nuScenes dataset and examines two different methods. The first method is explicit, applying three prompt-based strategies on a domain-tailored 4B VLM called Mini-InternVL2-4B-DA-DriveLM, which notably reduces NLI contradictions by up to 42.6% without any extra training. The second method is more implicit, using gated context projectors to pull hidden-state vectors and merge normalized projections into the next stage's input embeddings. These projectors are trained with stage-specific QLoRA adapters on a general 8B VLM (InternVL3-8B-Instruct), changing only about 0.3% of the parameters. This study establishes strong benchmarks for cross-stage coherence in hierarchical driving VQA.
Key facts
- Study on cross-stage context passing for GVQA in autonomous driving
- Uses DriveLM-nuScenes dataset
- Explicit variant uses prompt-based conditioning on Mini-InternVL2-4B-DA-DriveLM
- Reduces NLI contradiction by up to 42.6% without additional training
- Implicit variant introduces gated context projectors
- Projectors inject normalized, gated projections into next-stage input embeddings
- Jointly trained with QLoRA adapters on InternVL3-8B-Instruct
- Updates only approximately 0.3% of parameters
Entities
Institutions
- arXiv