GradingAttack Framework Exposes LLM Grading Vulnerabilities
A new framework called GradingAttack has been developed by researchers to identify security flaws in educational grading agents that utilize LLM technology. This system employs both token-level and prompt-level attack techniques to alter grading results discreetly. Tests conducted across various datasets indicate that both methods successfully undermine grading agents, with prompt-level attacks yielding superior success rates. This research underscores significant issues regarding the reliability of AI-driven grading systems used in practical settings.
Key facts
- GradingAttack is a fine-grained adversarial attack framework for LLM grading agents.
- It uses token-level and prompt-level attack strategies.
- Prompt-level attacks achieve higher success rates.
- Experiments conducted on multiple datasets.
- The framework exposes fundamental weaknesses in current agent deployments.
- LLMs are increasingly used for automatic short answer grading (ASAG).
- The study focuses on security vulnerabilities of grading agents in the wild.
- The paper is available on arXiv with ID 2602.00979.
Entities
Institutions
- arXiv