Automated Discourse Analysis System for Science Classrooms
A new automated discourse analysis system (ADAS) jointly classifies teacher and student utterances by Utterance Type and Reasoning Component, based on the CDAT framework. To handle severe label imbalance, the system uses stratified resplitting, LLM-based synthetic data augmentation for minority classes, and a dual-probe head RoBERTa-base classifier. A zero-shot GPT-5.4 baseline achieved macro-F1 scores of 0.467 on UT and 0.476 on RC, establishing upper bounds for prompt-only approaches and motivating fine-tuning. The system aims to reduce the labor-intensive manual coding of classroom discourse at scale, enabling deeper understanding of knowledge construction and improved instructional practice.
Key facts
- ADAS jointly classifies teacher and student utterances
- Uses two dimensions: Utterance Type and Reasoning Component
- Based on CDAT framework
- Addresses severe label imbalance
- Applies stratified resplitting of annotated corpus
- Uses LLM-based synthetic data augmentation for minority classes
- Trains a dual-probe head RoBERTa-base classifier
- Zero-shot GPT-5.4 baseline achieved macro-F1 0.467 on UT and 0.476 on RC
Entities
—