Tsallis Loss Continuum Bridges RLVR and Density Estimation in Reasoning Models

ai-technology · 2026-04-30

A recent study published on arXiv (2604.25907) presents a loss family J_Q that utilizes the Tsallis q-logarithm, bridging reinforcement learning from verifiable rewards (RLVR) at q=0 and log-marginal-likelihood over latent trajectories at q=1. The researchers demonstrate that all variants maintain the same gradient direction per example, differing solely by a scalar amplification P_θ^{-q} that adjusts instances independently of the learning rate. This amplification resolves cold-start stalling: under gradient flow, the exploitation pole (q=0) needs Ω(1/p_0) time to move past cold start, whereas the density-estimation pole (q=1) requires Θ(log(1/p_0)). Intermediate q values balance escape speed with noise memorization, offering a theoretical basis for adapting reasoning models to new tasks using only output-level supervision.

Key facts

Paper arXiv:2604.25907 proposes loss family J_Q using Tsallis q-logarithm
Interpolates between RLVR (q=0) and log-marginal-likelihood (q=1)
All members share same per-example gradient direction
Scalar amplification P_θ^{-q} reweights instances independently of learning rate
Exploitation pole requires Ω(1/p_0) time to escape cold start
Density-estimation pole escapes in Θ(log(1/p_0)) time
Intermediate q trades escape speed against noise memorization
Addresses cold-start stalling when initial success probability p_0 is small

Tsallis Loss Continuum Bridges RLVR and Density Estimation in Reasoning Models

Key facts

Entities

Institutions

Sources