SciEval: Benchmark for Automated K-12 Science Material Evaluation
A team of researchers has launched SciEval, the inaugural benchmark dataset for Automatic Instructional Materials Evaluation (AIME), which aims to assess K-12 science instructional resources generated by AI. This dataset features instructional materials that have been annotated with evaluation scores aligned to pedagogical standards and supported by evidence-based justifications from expert annotators. Published on arXiv, the study presents AIME as a generative AI challenge that forecasts scores and evidence based on rubrics crafted by educators. Baseline models have been created to evaluate the performance and reliability of large language models (LLMs) in this context, as their efficacy in evaluating instructional materials is still uncertain. The increasing integration of AI in education highlights the necessity for automated evaluation, given the labor-intensive nature of manual reviews.
Key facts
- SciEval is the first dataset for Automatic Instructional Materials Evaluation (AIME).
- Dataset includes instructional materials with pedagogy-aligned scores and evidence-based rationales.
- AIME is formulated as a generative AI task predicting scores and evidence using educator-designed rubrics.
- Baseline models are developed for AIME.
- LLMs' performance on instructional materials evaluation is unclear.
- Manual review of instructional materials is time-consuming and expertise-intensive.
- The work is published on arXiv with ID 2604.25472v1.
- More educators are using generative AI to create instructional materials.
Entities
Institutions
- arXiv