HILBERT Framework Introduces Dual Contrastive Alignment for Audio-Text Representation Learning
A novel framework called HILBERT (HIerarchical Long-sequence Balanced Embedding with Reciprocal contrastive Training) has been developed to facilitate the learning of audio-text representations at the document level in low-resource contexts. This system leverages frozen, pre-trained encoders for both speech and language to derive segment-level features, which are then combined using self-attentive pooling and cross-modal attention. HILBERT incorporates a dual reciprocal contrastive objective to synchronize audio-to-joint and text-to-joint representations, alongside two auxiliary regularizers: a Centered Kernel Alignment (CKA) loss and a technique for information-balanced regularization. The study, cataloged as arXiv:2604.16247v1, aims to tackle issues in low-resource settings, focusing on performance and stability without requiring extensive retraining. The framework promotes dynamic interactions between audio and text modalities throughout the sequence.
Key facts
- HILBERT (HIerarchical Long-sequence Balanced Embedding with Reciprocal contrastive Training) is a cross-attentive multimodal framework
- It learns document-level audio-text representations from long, segmented sequences in low-resource data settings
- The framework leverages frozen pre-trained speech and language encoders to extract segment-level features
- Features are aggregated via cross-modal attention and self-attentive pooling to form modality-specific document representations and a joint cross-attentive embedding
- A reciprocal dual contrastive objective aligns audio-to-joint and text-to-joint representations simultaneously
- Two auxiliary regularizers stabilize long-sequence fusion: a Centered Kernel Alignment (CKA) loss and information-balanced regularization
- The research was announced on arXiv with identifier arXiv:2604.16247v1 as a cross announcement type
- The framework addresses severe audio-text dimensional imbalance while preserving modality-specific structure
Entities
Institutions
- arXiv