View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Leveraging multi modal deep-learning to predict anxiety symptoms from parent-child interactions

        Thumbnail
        View/Open
        Sotiris Zenios MSc_Thesis.pdf (11.44Mb)
        Publication date
        2025
        Author
        Zenios, Sotiris
        Metadata
        Show full item record
        Summary
        Early, unobtrusive detection of anxiety in children can facilitate timelier and more equitable access to mental-health interventions. This thesis developed and evaluated a deep-learning pipeline that predicts anxiety-related behaviours from recordings of parent–child interactions collected in the YOUth Cohort Study. The dataset comprised 100 dyads of nine-year-old children filmed in conflict and cooperative tasks. After automatic diarisation, four synchronised child-centred streams, which include facial expressions, body posture, speech acoustics and transcribed language were extracted. Strong unimodal baselines were first established with VideoMAE and FMAEIAT for vision, a CNN to LSTM pipeline for audio, and RobBERT for text, with the linguistic channel achieving the best single-modality score (F1 = 0.62). Building on these results, a Pairwise Cross-Modal Attention Network was introduced to learn explicit interactions between modalities. This architecture raised overall performance to F1 = 0.64, outperformed classic fusion techniques, and remained resilient when individual streams were noisy or absent. Ablation analyses showed that body-pose embeddings operate as a pivotal hub queried by the other modalities, while filtering out parental speech proved crucial for stable audio contributions. Beyond delivering a systematic benchmark for anxiety detection on the YOUth material, the findings reaffirm the diagnostic value of language yet demonstrate that carefully designed cross-modal attention can uncover complementary visual and acoustic cues that text alone misses. Although the current performance is not yet clinically sufficient, the research charts a clear path toward scalable, context-aware pre-screening tools for childhood anxiety and lays the groundwork for future extensions, including the use of VLLMs, alternative audio backbones, and dyadic modelling of parental behaviour.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/50028
        Collections
        • Theses
        Utrecht university logo