Tsallis Loss Continuum Bridges RLVR and Density Estimation in Reasoning Models
A recent study published on arXiv (2604.25907) presents a loss family J_Q that utilizes the Tsallis q-logarithm, bridging reinforcement learning from verifiable rewards (RLVR) at q=0 and log-marginal-likelihood over latent trajectories at q=1. The researchers demonstrate that all variants maintain the same gradient direction per example, differing solely by a scalar amplification P_θ^{-q} that adjusts instances independently of the learning rate. This amplification resolves cold-start stalling: under gradient flow, the exploitation pole (q=0) needs Ω(1/p_0) time to move past cold start, whereas the density-estimation pole (q=1) requires Θ(log(1/p_0)). Intermediate q values balance escape speed with noise memorization, offering a theoretical basis for adapting reasoning models to new tasks using only output-level supervision.
Key facts
- Paper arXiv:2604.25907 proposes loss family J_Q using Tsallis q-logarithm
- Interpolates between RLVR (q=0) and log-marginal-likelihood (q=1)
- All members share same per-example gradient direction
- Scalar amplification P_θ^{-q} reweights instances independently of learning rate
- Exploitation pole requires Ω(1/p_0) time to escape cold start
- Density-estimation pole escapes in Θ(log(1/p_0)) time
- Intermediate q trades escape speed against noise memorization
- Addresses cold-start stalling when initial success probability p_0 is small
Entities
Institutions
- arXiv