ARTFEED — Contemporary Art Intelligence

Test-Time Personalization Scaling Failures in LLMs

ai-technology · 2026-05-13

A recent paper on arXiv (2605.10991) presents a novel framework for Test-Time Personalization (TTP) in large language models, emphasizing the enhancement of inference-time computations by sampling N candidates from a tailored policy model and determining the optimal choice using a personalized reward model. The authors demonstrate that oracle selection can lead to expected utility that increases logarithmically with the number of candidates, setting a theoretical upper limit. Nonetheless, traditional reward models do not achieve this potential. They introduce a unified scaling law that breaks down any reward model's Best-of-N curve into four quantifiable factors, identifying two failure modes: user-level collapse (constant predictions for certain users) and query-level reward hacking (negative correlation with actual quality for specific queries). The paper suggests a probabilistic solution to these challenges.

Key facts

  • arXiv paper 2605.10991
  • Focuses on Test-Time Personalization (TTP)
  • Scales inference-time computation by sampling N candidates
  • Oracle selection yields logarithmic utility growth
  • Standard reward models fail to scale
  • Identifies user-level collapse and query-level reward hacking
  • Proposes a probabilistic fix
  • Announce type: cross

Entities

Institutions

  • arXiv

Sources