MANTA: Multi-turn framework for LLM animal welfare alignment
A new evaluation framework called MANTA (Multi-turn Assessment for Nonhuman Thinking and Alignment) has been developed by researchers, utilizing the Inspect AI platform. In contrast to traditional single-turn benchmarks such as AnimalHarmBench (AHB), MANTA rigorously tests advanced LLMs in both professional and everyday contexts through the use of adversarially crafted follow-up questions. This innovative framework dynamically creates pressure turns based on the actual responses of each model, thereby generating specific adversarial challenges. It assesses models across a maximum of 13 scoring dimensions derived from AHB, utilizing a continuous scale from 0 to 1. Initial findings are detailed in arXiv:2605.16301.
Key facts
- MANTA is a multi-turn evaluation framework for LLM animal welfare alignment
- Built on the Inspect AI platform
- Uses adversarially generated follow-up questions
- Generates pressure turns dynamically from model responses
- Evaluates across up to 13 AHB-derived scoring dimensions
- Continuous 0-1 scale
- Preliminary results from arXiv:2605.16301
- Addresses failure mode where models capitulate under economic, social, or authority-based arguments
Entities
Institutions
- Inspect AI