AI Model Safety Benchmarks for Biological Weaponization Risks

ai-technology · 2026-04-24

A recent study published on arXiv evaluates four prominent AI models—ChatGPT 5.2 Auto, Gemini 3 Pro Thinking, Claude Opus 4.5, and Meta's Muse Spark Thinking—using 73 open-ended, novice-oriented benign STEM prompts to examine their operational intelligence and the risk of biological misuse. This research was prompted by concerns voiced by AI experts and safety reports highlighting how advancements in model reasoning could enable low-expertise users to misuse technology. Findings indicated that Gemini and Meta excelled in benign quantitative tasks, while ChatGPT was somewhat effective but lacked depth, and Claude exhibited the fewest responses, including some false-positive refusals. A follow-up test aimed at identifying subtle harmful intent showed that Gemini struggled with contextual awareness in edge cases. Consequently, a detailed analysis of Gemini's potential for weaponization was deemed necessary, as its capabilities seemed to surpass moderation efforts. The study revealed that the safeguards being developed by leading labs are still a work in progress.

Key facts

Benchmarked models: ChatGPT 5.2 Auto, Gemini 3 Pro Thinking, Claude Opus 4.5, Meta's Muse Spark Thinking
73 novice-framed, open-ended benign STEM prompts used
Gemini and Meta scored very high on benign quantitative tasks
ChatGPT partially useful but text-thinned
Claude sparsest with false-positive refusals
Second test set detected subtle harmful intent
Gemini showed lack of contextual awareness on edge case prompts
Focused weaponization analysis on Gemini across four access environments

AI Model Safety Benchmarks for Biological Weaponization Risks

Key facts

Entities

Institutions

Sources