Temporal Contrastive Decoding Method Addresses Bias in Audio-Language Models

ai-technology · 2026-04-20

A novel decoding method, known as Temporal Contrastive Decoding (TCD), has been introduced to mitigate a temporal smoothing bias found in large audio-language models (LALMs). These integrated models, which process speech, sound, and music, often overlook transient acoustic features in favor of smoother contexts influenced by language priors, resulting in less precise outputs. TCD functions during inference by generating a temporally blurred version of the input waveform and comparing its next-token predictions with those from the original input. This contrastive signal serves as a token-level logit update, limited to a small set of candidates. The approach employs a self-normalized stability score to establish the blur window and update scale, with a step-wise gate that activates updates based on uncertainty and audio dependence. Experiments were performed on MMAU and AI, and the findings were published on arXiv under identifier 2604.15383v1.

Key facts

Temporal Contrastive Decoding (TCD) is a training-free method for large audio-language models (LALMs).
LALMs generalize across speech, sound, and music but can exhibit a temporal smoothing bias.
This bias causes transient acoustic cues to be underutilized in favor of temporally smooth context.
TCD mitigates this effect at inference time by contrasting next-token logits from original and blurred views.
The method applies a token-level logit update restricted to a small candidate set.
A self-normalized stability score sets the blur window and update scale.
A step-wise gate based on uncertainty and audio reliance activates the update only when needed.
Experiments were conducted on MMAU and AI.

Temporal Contrastive Decoding Method Addresses Bias in Audio-Language Models

Key facts

Entities

Institutions

Sources