A multimodal deep learning approach for automated assessment of depressive symptoms in children
Summary
Depression is a global health issue affecting individuals coming from all age groups, with
the prevalence in children rising. The early detection of depression in young ages is crucial,
yet several obstacles hinder the accurate assessment of depressive symptoms in young
populations. This study proposes a multimodal deep learning framework for the automated
assessment of depressive symptoms in children. We extracted audio and text features from
videos of parent-child interaction, to train and evaluate deep learning models. Our method-
ology involved the use of advanced feature extraction techniques, including Wav2Vec2.0
and CLAP for audio features, and RobBERT and SBERT for text features. In addition
to examining each modality independently, we explored how multimodal fusion could
enhance the accuracy of detecting depressive symptoms in children. This study indicates
that multimodal deep representations can effectively identify depressive symptoms in chil-
dren, particularly in contexts involving cooperative tasks. The combination of SBERT and
CLAP feature representations yielded an AUC score of 0.810 in the cooperative scenario.
This result serves as a strong foundation for exploring the complex process of assessing
depressive symptoms as more data becomes available.