Persona-Traceable Shared RL Policy for Scalable Game NPCs

ai-technology · 2026-05-25

The innovative reinforcement learning technique known as pcsp (Persona Conditioned Shared Policy) facilitates scalable and consistent behavior for NPCs in life simulation games. In testing against a benchmark of 300 personas, pcsp demonstrates compositional zero-shot persona identification that is up to 17 times better than random chance, achieves a Spearman rho of approximately 0.73 for semantic-behavioral alignment, and offers inference speeds 22 times quicker than a baseline using LLM as policy. This approach employs a unified policy reliant on frozen LLM embeddings derived from free-form persona descriptions, integrating one-time encoding per NPC, low-rank projection, neural conditioning, and a training objective combining PPO, InfoNCE, and KL diversity. It effectively overcomes the limitations of existing methods regarding persona consistency, controllability, and real-time inference.

Key facts

pcsp achieves 17x above chance persona identification
Spearman rho ≈ 0.73 semantic-behavioral alignment
22x faster inference than LLM-as-policy baseline
Single RL policy conditioned on frozen LLM embeddings
Uses PPO + InfoNCE + KL diversity training objective
Tested on 300-persona life-simulation benchmark
Addresses persona consistency, controllability, real-time inference
Combines once-per-NPC encoding, low-rank projection, neural conditioning

Entities

—

Sources

arXiv cs.AI — 2026-05-25