Multimodal Energy-Based Model Learning via MCMC Revision
A new learning framework for multimodal energy-based models (EBMs) is proposed, addressing the poor mixing of noise-initialized Langevin dynamics in joint data space. The framework integrates a multimodal VAE with a shared latent generator and joint inference model, both currently limited by unimodal Gaussian or Laplace parameterization. By interweaving MCMC revision, the method improves the capture of complex inter-modal dependencies. The work is published on arXiv under ID 2605.00644.
Key facts
- Energy-based models (EBMs) are a flexible class of deep generative models.
- Learning multimodal EBM by maximum likelihood requires MCMC sampling in joint data space.
- Noise-initialized Langevin dynamics often mixes poorly and fails to discover coherent inter-modal relationships.
- Multimodal VAEs capture inter-modal dependencies via a shared latent generator and joint inference model.
- Both the shared latent generator and joint inference model are parameterized as unimodal Gaussian or Laplace.
- This parameterization limits approximation of complex multimodal data structures.
- A learning framework is presented that interweaves MCMC revision.
- The framework studies the learning problem of multimodal EBM, shared latent generator, and joint inference model.
Entities
Institutions
- arXiv