Test-Time Personalization Scaling Failures in LLMs
A recent paper on arXiv (2605.10991) presents a novel framework for Test-Time Personalization (TTP) in large language models, emphasizing the enhancement of inference-time computations by sampling N candidates from a tailored policy model and determining the optimal choice using a personalized reward model. The authors demonstrate that oracle selection can lead to expected utility that increases logarithmically with the number of candidates, setting a theoretical upper limit. Nonetheless, traditional reward models do not achieve this potential. They introduce a unified scaling law that breaks down any reward model's Best-of-N curve into four quantifiable factors, identifying two failure modes: user-level collapse (constant predictions for certain users) and query-level reward hacking (negative correlation with actual quality for specific queries). The paper suggests a probabilistic solution to these challenges.
Key facts
- arXiv paper 2605.10991
- Focuses on Test-Time Personalization (TTP)
- Scales inference-time computation by sampling N candidates
- Oracle selection yields logarithmic utility growth
- Standard reward models fail to scale
- Identifies user-level collapse and query-level reward hacking
- Proposes a probabilistic fix
- Announce type: cross
Entities
Institutions
- arXiv