ARTFEED — Contemporary Art Intelligence

Multimodal Energy-Based Model Learning via MCMC Revision

publication · 2026-05-04

A new learning framework for multimodal energy-based models (EBMs) is proposed, addressing the poor mixing of noise-initialized Langevin dynamics in joint data space. The framework integrates a multimodal VAE with a shared latent generator and joint inference model, both currently limited by unimodal Gaussian or Laplace parameterization. By interweaving MCMC revision, the method improves the capture of complex inter-modal dependencies. The work is published on arXiv under ID 2605.00644.

Key facts

  • Energy-based models (EBMs) are a flexible class of deep generative models.
  • Learning multimodal EBM by maximum likelihood requires MCMC sampling in joint data space.
  • Noise-initialized Langevin dynamics often mixes poorly and fails to discover coherent inter-modal relationships.
  • Multimodal VAEs capture inter-modal dependencies via a shared latent generator and joint inference model.
  • Both the shared latent generator and joint inference model are parameterized as unimodal Gaussian or Laplace.
  • This parameterization limits approximation of complex multimodal data structures.
  • A learning framework is presented that interweaves MCMC revision.
  • The framework studies the learning problem of multimodal EBM, shared latent generator, and joint inference model.

Entities

Institutions

  • arXiv

Sources