SFFL: Reducing Cross-Modal Interference in Audio-Visual LLMs
The recently introduced Separate First, Fuse Later (SFFL) framework seeks to minimize cross-modal interference in audio-visual large language models (LLMs). This method promotes reasoning that is specific to each modality, generating distinct reasoning pathways for audio and visual inputs prior to merging the information for responses. Modality-preference labels are created through a data pipeline tailored to various modality input configurations and serve as an auxiliary reward in reinforcement learning, fostering a preference for modality cues based on specific instances. The research can be found on arXiv with the identifier 2605.09906.
Key facts
- SFFL stands for Separate First, Fuse Later
- Framework reduces cross-modal interference in audio-visual LLMs
- Enforces modality-specific chain-of-thought reasoning
- Produces separate audio and visual reasoning traces
- Modality-preference labels are constructed via a data pipeline
- Labels used as auxiliary reward in reinforcement learning
- Paper available on arXiv: 2605.09906
- Addresses hallucinations caused by cross-modal interference
Entities
Institutions
- arXiv