Dr. Shuchin Aeron

Associate Professor of ECE, Tufts University

Dr. Julia Gouvea

Associate Professor of Education, Tufts University

Automatic coding for SOLO taxonomy in lab reports via Contrastive Learning in the Wasserstein space

An important challenge in Learning Sciences is the coding of qualitative data for evidence of students’ engagement in scientific practices. Such evidence may appear as novel ideas, expressions of puzzlement, or idiosyncratic lines of reasoning. Typically, identifying such evidence requires time-consuming labor by trained analysts. In this work we explore the possibility for supervised, statistical machine learning (ML) methods to aid learning sciences researchers in qualitative coding. We start with a human coded set of  lab reports from a biology course that were scored using an adapted version of the domain-general Structure of Observed Learning Outcomes (SOLO) taxonomy (Biggs and Collis, 1992). The adapted four-level scheme assigns higher scores to lab reports that exhibit desirable features of scientific writing, specifically: more complex claim structures, use of multiple evidences, and appropriately qualified conclusions that address or acknowledge uncertainty. Using this labeled data, we construct a natural language processing pipeline that represents words as vectors and describes language generation using a state space model.  We demonstrate that this approach can quantitatively capture SOLO, with a high QWK prediction score, when trained  via a novel contrastive learning set-up. This finding is subsequently corroborated via a blind re-coding experiment, wherein the reports that were always mis-classified by the ML algorithm in majority  of the cross-validation steps, were re-evaluated by human coders. We found that the ML predictions agreed well with the re-coded scores thereby indicating the possibility that computational NLP tools can approach the reliability of human coding and may assist researchers in automatic coding at-scale.