SFFL: Reducing Cross-Modal Interference in Audio-Visual LLMs

ai-technology · 2026-05-12

The recently introduced Separate First, Fuse Later (SFFL) framework seeks to minimize cross-modal interference in audio-visual large language models (LLMs). This method promotes reasoning that is specific to each modality, generating distinct reasoning pathways for audio and visual inputs prior to merging the information for responses. Modality-preference labels are created through a data pipeline tailored to various modality input configurations and serve as an auxiliary reward in reinforcement learning, fostering a preference for modality cues based on specific instances. The research can be found on arXiv with the identifier 2605.09906.

Key facts

SFFL stands for Separate First, Fuse Later
Framework reduces cross-modal interference in audio-visual LLMs
Enforces modality-specific chain-of-thought reasoning
Produces separate audio and visual reasoning traces
Modality-preference labels are constructed via a data pipeline
Labels used as auxiliary reward in reinforcement learning
Paper available on arXiv: 2605.09906
Addresses hallucinations caused by cross-modal interference

SFFL: Reducing Cross-Modal Interference in Audio-Visual LLMs

Key facts

Entities

Institutions

Sources