Test-Time Training Creates New AI Jailbreak Vulnerabilities
Researchers have identified that Test-Time Training (TTT), a technique allowing AI models to adapt during inference, introduces exploitable security flaws. In a study on arXiv, they outline three threat models where attackers can bypass safety filters. Under LoRA, the few-shot and generation-phase models achieved an average Attack Success Rate (ASR@10) of 95% and 93% across various model families and scales. These vulnerabilities also transfer to production fine-tuning APIs. The paper warns that TTT-induced overfitting can produce degenerate outputs inflating ASR under standard judges.
Key facts
- Test-Time Training (TTT) enables models to adapt parameters during inference.
- Three threat models for TTT are identified.
- Attackers can exploit TTT to bypass safety filters.
- Under LoRA, few-shot threat model achieves average ASR@10 of 95%.
- Under LoRA, generation-phase threat model achieves average ASR@10 of 93%.
- Vulnerabilities transfer to production fine-tuning APIs.
- TTT-induced overfitting can produce degenerate outputs that inflate ASR.
- Study published on arXiv with ID 2605.22984.
Entities
Institutions
- arXiv