Hi-SAM: Hierarchical Framework for Multi-modal Recommendation
The Hi-SAM (Hierarchical Structure-Aware Multi-modal) framework has been developed to tackle issues associated with multi-modal recommendation systems. Existing methods based on semantic IDs, such as RQ-VAE, face challenges with inadequate tokenization due to the entanglement of shared cross-modal semantics and modality-specific information, leading to redundancy or collapse. Furthermore, traditional Transformers view semantic IDs as flat sequences, overlooking the hierarchical nature of user interactions, items, and tokens, which skews attention towards local specifics. Hi-SAM features a Disentangled Semantic Tokenizer (DST) that integrates modalities through geometry-aware alignment and employs a coarse-to-fine quantization strategy with shared codebooks. This framework is elaborated in a paper available on arXiv (2602.11799).
Key facts
- Hi-SAM stands for Hierarchical Structure-Aware Multi-modal framework.
- It addresses suboptimal tokenization in existing methods like RQ-VAE.
- Existing methods lack disentanglement between cross-modal semantics and modality-specific details.
- Vanilla Transformers ignore the hierarchy of user interactions, items, and tokens.
- Hi-SAM uses a Disentangled Semantic Tokenizer (DST).
- DST unifies modalities via geometry-aware alignment.
- Quantization uses a coarse-to-fine strategy with shared codebooks.
- The paper is available on arXiv with ID 2602.11799.
Entities
Institutions
- arXiv