Test-Time Personalization Scaling Failures in LLMs

ai-technology · 2026-05-13

A recent paper on arXiv (2605.10991) presents a novel framework for Test-Time Personalization (TTP) in large language models, emphasizing the enhancement of inference-time computations by sampling N candidates from a tailored policy model and determining the optimal choice using a personalized reward model. The authors demonstrate that oracle selection can lead to expected utility that increases logarithmically with the number of candidates, setting a theoretical upper limit. Nonetheless, traditional reward models do not achieve this potential. They introduce a unified scaling law that breaks down any reward model's Best-of-N curve into four quantifiable factors, identifying two failure modes: user-level collapse (constant predictions for certain users) and query-level reward hacking (negative correlation with actual quality for specific queries). The paper suggests a probabilistic solution to these challenges.

Key facts

arXiv paper 2605.10991
Focuses on Test-Time Personalization (TTP)
Scales inference-time computation by sampling N candidates
Oracle selection yields logarithmic utility growth
Standard reward models fail to scale
Identifies user-level collapse and query-level reward hacking
Proposes a probabilistic fix
Announce type: cross

Test-Time Personalization Scaling Failures in LLMs

Key facts

Entities

Institutions

Sources