ARTFEED — Contemporary Art Intelligence

AI Scoring of Scientific Explanations: Data Augmentation for Class Imbalance

ai-technology · 2026-04-24

A study investigates data augmentation strategies to address class imbalance in transformer-based automated scoring of students' scientific explanations. Using a dataset of 1,466 high school responses to a physical science assessment aligned with an NGSS learning progression, the rubric includes 11 binary-coded analytic categories covering six complete explanation components and five common incomplete or inaccurate ideas. The baseline model is SciBERT, fine-tuned and tested with three augmentation methods: GPT-4-generated synthetic responses, EASE (a word-level extraction and filtering approach), and ALP (Augmentation using Lexicalized Probabilistic context-free grammar). The research aims to improve accuracy in scoring advanced reasoning, which is often underrepresented in rubric categories.

Key facts

  • Dataset consists of 1,466 high school responses
  • Responses scored on 11 binary-coded analytic categories
  • Rubric identifies six important components and five common incomplete or inaccurate ideas
  • Baseline model is SciBERT
  • Three augmentation strategies tested: GPT-4 synthetic responses, EASE, and ALP
  • Study addresses class imbalance in automated scoring of scientific explanations
  • Assessment is based on an NGSS-aligned learning progression
  • Research aims to improve transformer-based text classification for student responses

Entities

Sources