KnotBench: A Hard Benchmark for Vision-Language Models Using Knot Diagrams

ai-technology · 2026-05-12

KnotBench, a new standard, evaluates vision-language models on reasoning related to diagrammatic knots. It utilizes a dataset comprising 858,318 images derived from 1,951 prime-knot prototypes, with crossing numbers ranging from 3 to 19, and employs a protocol verified against Regina's canonical knot signature. The assessment includes 14 tasks categorized into four groups: equivalence judgment, move prediction, identification, and cross-modal grounding. A division between image and symbol highlights shortcomings in the perception-operation gap. Claude Opus 4.7 and GPT-5 were evaluated, both with and without reasoning, under a 64K output-token limit. Out of 56 (task, model) combinations, 15 fell at or below a random baseline, with 8 of 14 tasks achieving a maximum score below 1.5 times random. No model achieved a perfectly accurate string in diagram-to-symbol transcription, and permissive Regina decoding managed to recover the knot in 0 to 4 out of 100 instances. Utilizing reasoning mode improved overall accuracy by 1.65 percentage points.

Key facts

KnotBench uses 858,318 images from 1,951 prime-knot prototypes
Crossing numbers range from 3 to 19
14 tasks across four families: equivalence judgment, move prediction, identification, cross-modal grounding
Claude Opus 4.7 and GPT-5 tested with and without thinking
64K output-token budget matched on both vendors
15 of 56 (task, model) cases at or below random baseline
8 of 14 tasks have best score under 1.5x random
No model produces strictly correct diagram-to-symbol transcription
Permissive Regina decoding recovers knot in 0 to 4 of 100 items
Thinking-mode reasoning lifts accuracy by 1.65 percentage points

KnotBench: A Hard Benchmark for Vision-Language Models Using Knot Diagrams

Key facts

Entities

Institutions

Sources