AVISE Framework for AI Security Evaluation
A novel open-source framework named AVISE (AI Vulnerability Identification and Security Evaluation) has been launched to methodically pinpoint vulnerabilities and assess the security of AI systems and models. This framework enhances the multi-turn Red Queen attack, based on theory-of-mind, by incorporating an Adversarial Language Model (ALM) augmented attack. Additionally, it features an automated Security Evaluation Test (SET) designed to uncover jailbreak vulnerabilities in language models. The SET consists of 25 test cases along with an Evaluation Language Model (ELM) that evaluates whether each case successfully jailbroke the target model, achieving a remarkable 92% accuracy and an F1-score. This research is available on arXiv with the identifier 2604.20833.
Key facts
- AVISE is a modular open-source framework for AI security evaluation.
- The framework extends the Red Queen attack into an ALM-augmented attack.
- The SET includes 25 test cases for jailbreak vulnerability discovery.
- The ELM achieves 92% accuracy in detecting jailbreak attempts.
- The research is published on arXiv with ID 2604.20833.
- The framework targets vulnerabilities in AI systems and models.
- The attack is based on theory-of-mind multi-turn interactions.
- The SET is automated for evaluating language model security.
Entities
Institutions
- arXiv