ESTBook: A Pedagogical Diagnostic Benchmark for LLMs on English Standardized Tests

other · 2026-05-01

A new benchmark called ESTBook has been developed by researchers to assess large language models (LLMs) not only on their accuracy in tests but also on their ability to reason pedagogically. This benchmark comprises 10,576 questions spanning 29 different task types derived from five prominent English standardized exams. In contrast to conventional datasets, ESTBook enhances its questions with structured reasoning paths and rationales for distractors that highlight particular cognitive pitfalls. The framework conceptualizes test problem-solving as navigating a cognitive landscape, with the goal of determining if LLMs can demonstrate reliable reasoning, clarify solution methods, and identify human misconceptions. This research is available on arXiv with the identifier 2505.17056.

Key facts

ESTBook is a multimodal benchmark for LLMs on English standardized tests.
It includes 10,576 questions and 29 task types across five major exams.
The benchmark enriches questions with reasoning trajectories and distractor rationales.
The framework models problem-solving as a cognitive framework traversal.
It aims to evaluate faithful reasoning, solution strategies, and misconception diagnosis.
The research is published on arXiv with ID 2505.17056.
Current evaluations focus on binary outcome accuracy, which ESTBook aims to improve.
The work is from arXiv, not a peer-reviewed journal.

ESTBook: A Pedagogical Diagnostic Benchmark for LLMs on English Standardized Tests

Key facts

Entities

Institutions

Sources