Leveraging multi modal deep-learning to predict anxiety symptoms from parent-child interactions

Zenios, Sotiris

View/Open

Sotiris Zenios MSc_Thesis.pdf (11.44Mb)

Publication date

2025

Author

Zenios, Sotiris

Metadata

Show full item record

Summary

Early, unobtrusive detection of anxiety in children can facilitate timelier and more equitable access to mental-health interventions. This thesis developed and evaluated a deep-learning pipeline that predicts anxiety-related behaviours from recordings of parent–child interactions collected in the YOUth Cohort Study. The dataset comprised 100 dyads of nine-year-old children filmed in conflict and cooperative tasks. After automatic diarisation, four synchronised child-centred streams, which include facial expressions, body posture, speech acoustics and transcribed language were extracted. Strong unimodal baselines were first established with VideoMAE and FMAEIAT for vision, a CNN to LSTM pipeline for audio, and RobBERT for text, with the linguistic channel achieving the best single-modality score (F1 = 0.62). Building on these results, a Pairwise Cross-Modal Attention Network was introduced to learn explicit interactions between modalities. This architecture raised overall performance to F1 = 0.64, outperformed classic fusion techniques, and remained resilient when individual streams were noisy or absent. Ablation analyses showed that body-pose embeddings operate as a pivotal hub queried by the other modalities, while filtering out parental speech proved crucial for stable audio contributions. Beyond delivering a systematic benchmark for anxiety detection on the YOUth material, the findings reaffirm the diagnostic value of language yet demonstrate that carefully designed cross-modal attention can uncover complementary visual and acoustic cues that text alone misses. Although the current performance is not yet clinically sufficient, the research charts a clear path toward scalable, context-aware pre-screening tools for childhood anxiety and lays the groundwork for future extensions, including the use of VLLMs, alternative audio backbones, and dyadic modelling of parental behaviour.

URI

https://studenttheses.uu.nl/handle/20.500.12932/50028

Collections

Theses