SwordBench: New Benchmark for Steering Vision Model Representations
Researchers have introduced SwordBench, a benchmark for evaluating the steering of image representations in vision models. The work addresses a gap in existing evaluation protocols, which have been limited to ambiguous language modeling tasks. SwordBench assesses steering across multiple model backbones and concept removal tasks. It introduces new evaluation metrics: cross-concept robustness, which measures the stability of concept detection when inputs are orthogonalized against alternative concepts, and collateral damage, which quantifies unintended effects on downstream task performance for inputs lacking the bias. The findings indicate that a linear support vector machine exhibits superior separability, though the abstract does not specify the full results. The paper is available on arXiv under the identifier 2605.16372.
Key facts
- SwordBench is a benchmark for steering image representations in vision models.
- It evaluates steering across multiple backbones and concept removal tasks.
- New metrics include cross-concept robustness and collateral damage.
- Cross-concept robustness measures stability of concept detection after orthogonalization.
- Collateral damage quantifies unintended performance effects on unbiased inputs.
- A linear SVM shows superior separability in the experiments.
- The paper is on arXiv with ID 2605.16372.
- Existing protocols were limited to language modeling tasks.
Entities
Institutions
- arXiv