Human Alignment Peaks at Intermediate Generative-Discriminative Training
A new study on arXiv (2605.23819) investigates whether we understand human-like visual representations better through discriminative or generative learning, tackling problems seen in past research. The team used Joint Energy-Based Models (JEMs) that merge both methods into one framework, which allowed them to see how the learning goal affects outcomes by changing a single mixing coefficient. They tested this across six benchmarks related to human alignment—like perceptual similarity and gloss perception. The findings showed that human alignment peaked at moderate points on the generative-discriminative scale, suggesting that a balanced approach is more effective for creating representations that resemble human perception.
Key facts
- Study uses Joint Energy-Based Models (JEMs) to interpolate between discriminative and generative training.
- Varies a single mixing coefficient to isolate the effect of the learning objective.
- Evaluates models across six human-alignment benchmarks.
- Human alignment is consistently maximized at intermediate points of the continuum.
- Addresses confounds of architecture, scale, and training data in prior comparisons.
- Benchmarks include perceptual similarity, gloss perception, and human response uncertainty.
- Also tests robustness, shape-texture cue conflict, and diagnostic feature attribution.
- Published on arXiv with ID 2605.23819.
Entities
Institutions
- arXiv