ARTFEED — Contemporary Art Intelligence

New Research Proposes Hierarchical Framework to Strengthen Vision-Language Models Against Adversarial Attacks

ai-technology · 2026-04-22

A new framework for adversarial fine-tuning has been developed to improve the resilience of Vision-Language Models (VLMs) against targeted threats. This method utilizes the natural hierarchical organization of class spaces, tackling a weakness where models falter when adversarial attacks focus on both superclasses, such as 'mammal', and their specific leaf classes like 'cat'. Current robust fine-tuning strategies often match fixed text embeddings with image embeddings, potentially undermining overall performance and strength. The innovative framework introduces hierarchical embeddings and establishes multiple levels of adversarially robust alignment between the text and image modalities. It also incorporates mechanisms to place visual embeddings at specific depths within the hierarchy. A theoretical link between the embedding depth in the hierarchy and the maximum feasible margin size is established. This research, presented in the paper 'Hierarchically Robust Zero-shot Vision-language Models' (arXiv:2604.18867v1), represents a cross-disciplinary effort aimed at enhancing zero-shot classification in VLMs, which remain vulnerable to adversarial attacks despite their sophisticated capabilities.

Key facts

  • Vision-Language Models (VLMs) can perform zero-shot classification but are vulnerable to adversarial attacks.
  • Existing robust fine-tuning methods align fixed text embeddings with image embeddings, sacrificing natural performance and robustness.
  • A robustness degradation occurs when models face adversarial attacks targeting superclasses (parent classes) in addition to base leaf classes.
  • The proposed framework is based on hierarchical embeddings and several levels of adversarially robust alignment of image-text modalities.
  • Additional mechanisms place visual embeddings at the desired depth of the hierarchy.
  • A theoretical connection is provided between the depth of embedding in the hierarchy and the maximum viable margin size.
  • The research is detailed in the paper 'Hierarchically Robust Zero-shot Vision-language Models' with identifier arXiv:2604.18867v1.
  • The announcement type for the paper is cross, indicating a cross-disciplinary study.

Entities

Sources