ARTFEED — Contemporary Art Intelligence

Test-Time Training Creates New AI Jailbreak Vulnerabilities

ai-technology · 2026-05-25

Researchers have identified that Test-Time Training (TTT), a technique allowing AI models to adapt during inference, introduces exploitable security flaws. In a study on arXiv, they outline three threat models where attackers can bypass safety filters. Under LoRA, the few-shot and generation-phase models achieved an average Attack Success Rate (ASR@10) of 95% and 93% across various model families and scales. These vulnerabilities also transfer to production fine-tuning APIs. The paper warns that TTT-induced overfitting can produce degenerate outputs inflating ASR under standard judges.

Key facts

  • Test-Time Training (TTT) enables models to adapt parameters during inference.
  • Three threat models for TTT are identified.
  • Attackers can exploit TTT to bypass safety filters.
  • Under LoRA, few-shot threat model achieves average ASR@10 of 95%.
  • Under LoRA, generation-phase threat model achieves average ASR@10 of 93%.
  • Vulnerabilities transfer to production fine-tuning APIs.
  • TTT-induced overfitting can produce degenerate outputs that inflate ASR.
  • Study published on arXiv with ID 2605.22984.

Entities

Institutions

  • arXiv

Sources