EduAgentBench: A Multi-Stage Benchmark for AI Tutor Agents
EduAgentBench, a novel benchmark, assesses language agents based on actual teaching processes. It comprises 150 meticulously curated tasks that span three areas of capability: professional pedagogical judgment, situated multi-turn tutoring, and completion of Canvas-style teaching workflows. This benchmark aims to gauge a tutor agent's proficiency in diagnosing learner states, adjusting support over time, making pedagogically sound choices, and implementing interventions in authentic learning management systems. The tasks are developed using a pipeline informed by pedagogical insights and are validated through additional verification methods.
Key facts
- EduAgentBench is a source-grounded benchmark for evaluating tutor agents.
- It contains 150 quality-controlled tasks.
- Tasks cover three capability surfaces: professional pedagogical judgment, situated multi-turn tutoring, and Canvas-style teaching workflow completion.
- The benchmark assesses diagnosis of learner state, adaptation of support, pedagogically justified decisions, and execution of interventions.
- Tasks are constructed through a pedagogical-insight-driven pipeline.
- Evaluation uses complementary verification.
- The benchmark addresses the gap in measuring tutoring capabilities of language agents.
- Effective tutor agents require more than correct answers or accurate tool calls.
Entities
Institutions
- arXiv