ARTFEED — Contemporary Art Intelligence

Circle-RoPE: New Positional Embedding for Vision-Language Models

ai-technology · 2026-05-23

A new research article presents Circle-RoPE, an innovative method for positional embedding tailored for large vision-language models (VLMs). This study tackles a drawback of Rotary Position Embedding (RoPE), which links text and image position indices, leading to unintended cross-modal relative-position bias. The researchers introduce Per-Token Distance (PTD) to measure this bias, demonstrating that when PTD = 0, geometric attention bias is eliminated. Circle-RoPE reconfigures 2D image-token coordinates onto an annulus that is perpendicular to the text position axis, resulting in a cone-like structure where each text token maintains equal distance from all image tokens while retaining the spatial organization within images. Furthermore, Alternating Geometry Encoding (AGE) integrates Circle-RoPE with standard RoPE across different layers. The paper can be found on arXiv with ID 2505.16416.

Key facts

  • Circle-RoPE is a new positional embedding for VLMs.
  • It addresses cross-modal relative-position bias in RoPE.
  • Per-Token Distance (PTD) quantifies positional disentanglement.
  • PTD = 0 is sufficient to eliminate geometric attention bias.
  • Circle-RoPE remaps 2D image tokens onto an annulus.
  • Alternating Geometry Encoding (AGE) combines Circle-RoPE and RoPE.
  • The paper is on arXiv: 2505.16416.
  • The work is authored by researchers (not named in source).

Entities

Institutions

  • arXiv

Sources