StyleVAR: Visual Autoregressive Model for Controllable Style Transfer
StyleVAR is an innovative approach to image style transfer that utilizes the Visual Autoregressive Modeling (VAR) framework. It treats style transfer as a conditional discrete sequence modeling task within a learned latent space. Images are segmented into multi-scale representations and converted into discrete codes using a VQ-VAE. Subsequently, a transformer autoregressively predicts the distribution of target tokens based on style and content tokens. A blended cross-attention mechanism is employed to incorporate style and content information, allowing the evolving target representation to reference its history while style and content features guide which historical aspects to highlight. A scale-dependent blending coefficient regulates the influence of style and content throughout the process, ensuring the synthesized output maintains both content structure and style texture without disrupting VAR's autoregressive continuity. The model undergoes training in two phases. The research can be found on arXiv with the identifier 2604.21052.
Key facts
- StyleVAR builds on the Visual Autoregressive Modeling (VAR) framework.
- Style transfer is formulated as conditional discrete sequence modeling in a learned latent space.
- Images are decomposed into multi-scale representations and tokenized by a VQ-VAE.
- A transformer autoregressively models target tokens conditioned on style and content tokens.
- A blended cross-attention mechanism is introduced for style and content injection.
- A scale-dependent blending coefficient controls style and content influence at each stage.
- The model is trained in two stages.
- The paper is available on arXiv with identifier 2604.21052.
Entities
Institutions
- arXiv