TRACE: A Turn-Aware Credit Assignment Framework for Multi-Turn Jailbreaking
A new arXiv paper (2605.08778) introduces TRACE, a turn-aware credit assignment framework for reinforcement learning-based multi-turn jailbreaking attacks on LLMs. The authors identify that in multi-turn dialogues, turn-level contributions to successful jailbreaks are non-uniform, phase-dependent, and target-specific. Coarse trajectory-level outcome signals cause a credit assignment problem, over-rewarding redundant turns and under-crediting useful intermediate turns. TRACE addresses this by estimating turn-level contributions via leave-one-turn-out semantic masking for successful trajectories and assigning credit for failed ones. The research aims to improve the effectiveness of multi-turn jailbreak attacks by providing more granular feedback.
Key facts
- arXiv paper 2605.08778
- TRACE framework for multi-turn jailbreaking
- Turn-level contributions are non-uniform, phase-dependent, and target-specific
- Coarse outcome signals cause credit assignment problem
- Leave-one-turn-out semantic masking for successful trajectories
- Addresses over-rewarding redundant turns and under-crediting useful intermediate turns
- Uses reinforcement learning for attack strategies
- Focuses on LLM multi-turn dialogues
Entities
Institutions
- arXiv