Persona-Traceable Shared RL Policy for Scalable Game NPCs
The innovative reinforcement learning technique known as pcsp (Persona Conditioned Shared Policy) facilitates scalable and consistent behavior for NPCs in life simulation games. In testing against a benchmark of 300 personas, pcsp demonstrates compositional zero-shot persona identification that is up to 17 times better than random chance, achieves a Spearman rho of approximately 0.73 for semantic-behavioral alignment, and offers inference speeds 22 times quicker than a baseline using LLM as policy. This approach employs a unified policy reliant on frozen LLM embeddings derived from free-form persona descriptions, integrating one-time encoding per NPC, low-rank projection, neural conditioning, and a training objective combining PPO, InfoNCE, and KL diversity. It effectively overcomes the limitations of existing methods regarding persona consistency, controllability, and real-time inference.
Key facts
- pcsp achieves 17x above chance persona identification
- Spearman rho ≈ 0.73 semantic-behavioral alignment
- 22x faster inference than LLM-as-policy baseline
- Single RL policy conditioned on frozen LLM embeddings
- Uses PPO + InfoNCE + KL diversity training objective
- Tested on 300-persona life-simulation benchmark
- Addresses persona consistency, controllability, real-time inference
- Combines once-per-NPC encoding, low-rank projection, neural conditioning
Entities
—