CounterFlow: Counterfactual Video Foley Generation via Two-Phase Sampling

ai-technology · 2026-05-20

A team of researchers has introduced CounterFlow, a groundbreaking technique that works in two phases for pretrained flow-matching Video&Text-to-Audio (VT2A) models, aimed at creating counterfactual video sound effects. This method allows for the integration of sound that doesn't match the visuals, yet keeps everything in sync with a silent video. In the first phase, they establish a timing structure based on the video while downplaying the visual cues. The second phase focuses purely on the audio's quality without linking it to the video. CounterFlow improves sound effect generation compared to basic methods and existing standards. They also developed a new way to measure sound quality by analyzing how well the audio matches the intended prompt and its remaining content.

Key facts

CounterFlow is a two-phase inference-time sampling scheme for counterfactual video foley generation.
It works with pretrained flow-matching Video&Text-to-Audio (VT2A) models.
Phase 1 builds video-derived temporal structure while suppressing visually implied source.
Phase 2 drops video conditioning to focus on shaping audio timbre toward target prompt.
CounterFlow outperforms naive negative prompting and state-of-the-art baselines.
A new metric using text-audio co-embedding space evaluates replacement quality.
The metric measures both target-prompt evidence and residual source content.
The approach addresses VT2A models' tendency to stay anchored to visually implied sound sources.

Entities

—

Sources

arXiv cs.AI — 2026-05-20