Circle-RoPE: New Positional Embedding for Vision-Language Models
A new research article presents Circle-RoPE, an innovative method for positional embedding tailored for large vision-language models (VLMs). This study tackles a drawback of Rotary Position Embedding (RoPE), which links text and image position indices, leading to unintended cross-modal relative-position bias. The researchers introduce Per-Token Distance (PTD) to measure this bias, demonstrating that when PTD = 0, geometric attention bias is eliminated. Circle-RoPE reconfigures 2D image-token coordinates onto an annulus that is perpendicular to the text position axis, resulting in a cone-like structure where each text token maintains equal distance from all image tokens while retaining the spatial organization within images. Furthermore, Alternating Geometry Encoding (AGE) integrates Circle-RoPE with standard RoPE across different layers. The paper can be found on arXiv with ID 2505.16416.
Key facts
- Circle-RoPE is a new positional embedding for VLMs.
- It addresses cross-modal relative-position bias in RoPE.
- Per-Token Distance (PTD) quantifies positional disentanglement.
- PTD = 0 is sufficient to eliminate geometric attention bias.
- Circle-RoPE remaps 2D image tokens onto an annulus.
- Alternating Geometry Encoding (AGE) combines Circle-RoPE and RoPE.
- The paper is on arXiv: 2505.16416.
- The work is authored by researchers (not named in source).
Entities
Institutions
- arXiv