Education Data Science

Late-Stage Audio and Text Fusion: A Multimodal Approach to Detecting Student Engagement in Noisy Classroom Recordings

Project Year
2025
Abstract

We present a late-stage audio + text fusion model to detect student engagement from classroom audio recordings and transcripts. Using a dataset of annotated K-8 math classroom sessions, we compare four models: logistic regression on acoustic features, fine-tuned Wav2Vec2.0, a keyword-based text classifier, and a late stage audio + text fusion model. The fusion model achieves an F1-score of 0.78, substantially outperforming the best unimodal model (F1=0.69). Our results demonstrate that combining audio and text signals captures complementary indicators of student engagement in noisy classroom environments where student talk is difficult to transcribe and diarize accurately using automatic speech recognition (ASR) systems. This work highlights the promise of scalable multimodal approaches for detecting classroom engagement in real time.

EDS Students

Samin Khan
Samin Khan
Class: 2025
Areas of interest: AI in higher ed, AI research & development, product management, positive psychology, social networks