SONAR: A Framework for Sanitizing Malicious Instructions in LLMs
A new system named SONAR has been launched to tackle security issues in large language models (LLMs) that depend on external text sources, including retrieval-augmented generation and tool-integrated agents. These models are vulnerable to attacks where adversaries introduce harmful instructions that can lead to unexpected actions. Current protective measures, such as LLM-based detectors and training methods, often fall prey to optimization attacks or struggle with new data distributions. SONAR functions as a prompt sanitization tool, utilizing natural language inference metrics to identify and eliminate injected content. It creates a relational graph at the sentence level from user queries and external data, employing entailment and contradiction scores as weights to find sentences that stray from the main task. Anomalous sentences are pruned through connectivity-driven methods. This framework is elaborated in a paper on arXiv, identifier 2605.01078, classified as a cross-type announcement, aiming to bolster LLM security by emphasizing sentence relational structures over mere detection or training.
Key facts
- SONAR is a prompt sanitization framework for LLMs.
- It addresses security vulnerabilities from external textual sources.
- Uses natural language inference metrics to detect malicious instructions.
- Constructs a sentence-level relational graph with entailment and contradiction scores.
- Employs connectivity-driven pruning to remove deviant sentences.
- Published on arXiv with ID 2605.01078.
- Targets retrieval-augmented generation and tool-integrated LLM agents.
- Aims to overcome limitations of LLM-based detectors and training-based methods.
Entities
Institutions
- arXiv