Multimodal Energy-Based Model Learning via MCMC Revision

publication · 2026-05-04

A new learning framework for multimodal energy-based models (EBMs) is proposed, addressing the poor mixing of noise-initialized Langevin dynamics in joint data space. The framework integrates a multimodal VAE with a shared latent generator and joint inference model, both currently limited by unimodal Gaussian or Laplace parameterization. By interweaving MCMC revision, the method improves the capture of complex inter-modal dependencies. The work is published on arXiv under ID 2605.00644.

Key facts

Energy-based models (EBMs) are a flexible class of deep generative models.
Learning multimodal EBM by maximum likelihood requires MCMC sampling in joint data space.
Noise-initialized Langevin dynamics often mixes poorly and fails to discover coherent inter-modal relationships.
Multimodal VAEs capture inter-modal dependencies via a shared latent generator and joint inference model.
Both the shared latent generator and joint inference model are parameterized as unimodal Gaussian or Laplace.
This parameterization limits approximation of complex multimodal data structures.
A learning framework is presented that interweaves MCMC revision.
The framework studies the learning problem of multimodal EBM, shared latent generator, and joint inference model.

Multimodal Energy-Based Model Learning via MCMC Revision

Key facts

Entities

Institutions

Sources