ARTFEED — Contemporary Art Intelligence

Inference-Time Alignment with SLOP for Reward Hacking Mitigation

ai-technology · 2026-05-14

A new paper on arXiv introduces reference-model temperature adjustment to generalize inference-time alignment techniques, enabling the combination of generative reward models via a sharpened logarithmic opinion pool (SLOP). The authors propose an algorithm to calibrate SLOP weight parameters, demonstrating improved robustness against reward hacking while preserving alignment performance. This work extends existing theoretical justifications for inference-time alignment as approximations to optimally tilted sampling distributions.

Key facts

  • Inference-time alignment techniques are lightweight alternatives to reinforcement learning.
  • They enable continual adaptation as alignment objectives and reward targets evolve.
  • Existing theoretical analyses justify these methods as approximations to sampling from optimally tilted distributions.
  • The paper introduces reference-model temperature adjustment.
  • This leads to generalization of inference-time alignment to ensembles of generative reward models.
  • The combination is formulated as a sharpened logarithmic opinion pool (SLOP).
  • An algorithm for calibrating SLOP weight parameters is proposed.
  • Experiments show improved robustness against reward hacking while preserving alignment performance.

Entities

Institutions

  • arXiv

Sources