CloudWeb Attack Hijacks Remote Sensing Vision-Language RAG via Atmospheric Patterns
CloudWeb is a novel adversarial attack targeting the retrieval stage of vision-language multimodal RAG systems in remote sensing. Unlike prior attacks that manipulate the retrieval corpus or end-task predictions, CloudWeb modifies only the input image by overlaying parameterized cloud- and haze-like patterns. These patterns are optimized to pull adversarial image embeddings toward target atmospheric evidence while suppressing source-scene evidence, enforcing rank separation, and regularizing naturalness and coverage. The attack keeps the retriever, generator, and knowledge base fixed at deployment, exposing a critical vulnerability in input-space threats to evidence retrieval. This work highlights the need for robust defenses in remote sensing multimodal RAG systems.
Key facts
- CloudWeb is an atmospheric retrieval hijacking attack for remote sensing multimodal RAG.
- It modifies only the input image, keeping retriever, generator, and knowledge base fixed.
- The attack overlays cloud- and haze-like patterns on remote sensing images.
- Optimization objectives include pulling embeddings toward target evidence and suppressing source evidence.
- Existing adversarial studies mainly target retrieval corpus or end-task predictions.
- Input-space threats to evidence retrieval in remote sensing RAG are underexplored.
- The attack is introduced in arXiv paper 2605.07273.
- CloudWeb exposes vulnerabilities in vision-language retrievers for remote sensing.
Entities
Institutions
- arXiv