Test-Time Training Creates New AI Jailbreak Vulnerabilities

ai-technology · 2026-05-25

Researchers have identified that Test-Time Training (TTT), a technique allowing AI models to adapt during inference, introduces exploitable security flaws. In a study on arXiv, they outline three threat models where attackers can bypass safety filters. Under LoRA, the few-shot and generation-phase models achieved an average Attack Success Rate (ASR@10) of 95% and 93% across various model families and scales. These vulnerabilities also transfer to production fine-tuning APIs. The paper warns that TTT-induced overfitting can produce degenerate outputs inflating ASR under standard judges.

Key facts

Test-Time Training (TTT) enables models to adapt parameters during inference.
Three threat models for TTT are identified.
Attackers can exploit TTT to bypass safety filters.
Under LoRA, few-shot threat model achieves average ASR@10 of 95%.
Under LoRA, generation-phase threat model achieves average ASR@10 of 93%.
Vulnerabilities transfer to production fine-tuning APIs.
TTT-induced overfitting can produce degenerate outputs that inflate ASR.
Study published on arXiv with ID 2605.22984.

Test-Time Training Creates New AI Jailbreak Vulnerabilities

Key facts

Entities

Institutions

Sources