SciEval: Benchmark for Automated K-12 Science Material Evaluation

other · 2026-04-30

A team of researchers has launched SciEval, the inaugural benchmark dataset for Automatic Instructional Materials Evaluation (AIME), which aims to assess K-12 science instructional resources generated by AI. This dataset features instructional materials that have been annotated with evaluation scores aligned to pedagogical standards and supported by evidence-based justifications from expert annotators. Published on arXiv, the study presents AIME as a generative AI challenge that forecasts scores and evidence based on rubrics crafted by educators. Baseline models have been created to evaluate the performance and reliability of large language models (LLMs) in this context, as their efficacy in evaluating instructional materials is still uncertain. The increasing integration of AI in education highlights the necessity for automated evaluation, given the labor-intensive nature of manual reviews.

Key facts

SciEval is the first dataset for Automatic Instructional Materials Evaluation (AIME).
Dataset includes instructional materials with pedagogy-aligned scores and evidence-based rationales.
AIME is formulated as a generative AI task predicting scores and evidence using educator-designed rubrics.
Baseline models are developed for AIME.
LLMs' performance on instructional materials evaluation is unclear.
Manual review of instructional materials is time-consuming and expertise-intensive.
The work is published on arXiv with ID 2604.25472v1.
More educators are using generative AI to create instructional materials.

SciEval: Benchmark for Automated K-12 Science Material Evaluation

Key facts

Entities

Institutions

Sources