LLM Tutoring Agents Fail at Distinguishing Suboptimal from Incorrect Solutions

ai-technology · 2026-05-18

A recent study published on arXiv (2605.16207) assesses seven feedback agents based on large language models (LLMs) in the realm of propositional logic tutoring, utilizing ground truth derived from knowledge graphs across 10,836 pairs of solutions and feedback. While the models demonstrated near-perfect accuracy for optimal steps, they consistently rejected valid yet suboptimal reasoning and mistakenly validated incorrect solutions—areas where adaptive tutoring is crucial. These shortcomings appeared to stem from architectural limitations rather than issues with the information itself. Furthermore, accurate diagnostics did not consistently yield feedback that was actionable from a pedagogical perspective.

Key facts

Study evaluates seven LLM feedback agents in propositional logic tutoring
Uses knowledge-graph-derived ground truth across 10,836 solution-feedback pairs
Models near-ceiling on optimal steps but over-reject valid suboptimal reasoning
Models over-validate incorrect solutions
Failures persist across models regardless of solution context
Suggests architectural rather than informational limits
Accurate diagnosis does not reliably produce pedagogically actionable feedback
Published on arXiv with ID 2605.16207

LLM Tutoring Agents Fail at Distinguishing Suboptimal from Incorrect Solutions

Key facts

Entities

Institutions

Sources