Automated Discourse Analysis System for Science Classrooms

other · 2026-04-25

A new automated discourse analysis system (ADAS) jointly classifies teacher and student utterances by Utterance Type and Reasoning Component, based on the CDAT framework. To handle severe label imbalance, the system uses stratified resplitting, LLM-based synthetic data augmentation for minority classes, and a dual-probe head RoBERTa-base classifier. A zero-shot GPT-5.4 baseline achieved macro-F1 scores of 0.467 on UT and 0.476 on RC, establishing upper bounds for prompt-only approaches and motivating fine-tuning. The system aims to reduce the labor-intensive manual coding of classroom discourse at scale, enabling deeper understanding of knowledge construction and improved instructional practice.

Key facts

ADAS jointly classifies teacher and student utterances
Uses two dimensions: Utterance Type and Reasoning Component
Based on CDAT framework
Addresses severe label imbalance
Applies stratified resplitting of annotated corpus
Uses LLM-based synthetic data augmentation for minority classes
Trains a dual-probe head RoBERTa-base classifier
Zero-shot GPT-5.4 baseline achieved macro-F1 0.467 on UT and 0.476 on RC

Entities

—

Sources

arXiv cs.AI — 2026-04-25