ARTFEED — Contemporary Art Intelligence

AI Model Safety Benchmarks for Biological Weaponization Risks

ai-technology · 2026-04-24

A recent study published on arXiv evaluates four prominent AI models—ChatGPT 5.2 Auto, Gemini 3 Pro Thinking, Claude Opus 4.5, and Meta's Muse Spark Thinking—using 73 open-ended, novice-oriented benign STEM prompts to examine their operational intelligence and the risk of biological misuse. This research was prompted by concerns voiced by AI experts and safety reports highlighting how advancements in model reasoning could enable low-expertise users to misuse technology. Findings indicated that Gemini and Meta excelled in benign quantitative tasks, while ChatGPT was somewhat effective but lacked depth, and Claude exhibited the fewest responses, including some false-positive refusals. A follow-up test aimed at identifying subtle harmful intent showed that Gemini struggled with contextual awareness in edge cases. Consequently, a detailed analysis of Gemini's potential for weaponization was deemed necessary, as its capabilities seemed to surpass moderation efforts. The study revealed that the safeguards being developed by leading labs are still a work in progress.

Key facts

  • Benchmarked models: ChatGPT 5.2 Auto, Gemini 3 Pro Thinking, Claude Opus 4.5, Meta's Muse Spark Thinking
  • 73 novice-framed, open-ended benign STEM prompts used
  • Gemini and Meta scored very high on benign quantitative tasks
  • ChatGPT partially useful but text-thinned
  • Claude sparsest with false-positive refusals
  • Second test set detected subtle harmful intent
  • Gemini showed lack of contextual awareness on edge case prompts
  • Focused weaponization analysis on Gemini across four access environments

Entities

Institutions

  • arXiv

Sources