Leveraging multi modal deep-learning to predict anxiety symptoms from parent-child interactions

Zenios, Sotiris

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Önal Ertugrul, I.
dc.contributor.author	Zenios, Sotiris
dc.date.accessioned	2025-08-28T00:01:42Z
dc.date.available	2025-08-28T00:01:42Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/50028
dc.description.abstract	Early, unobtrusive detection of anxiety in children can facilitate timelier and more equitable access to mental-health interventions. This thesis developed and evaluated a deep-learning pipeline that predicts anxiety-related behaviours from recordings of parent–child interactions collected in the YOUth Cohort Study. The dataset comprised 100 dyads of nine-year-old children filmed in conflict and cooperative tasks. After automatic diarisation, four synchronised child-centred streams, which include facial expressions, body posture, speech acoustics and transcribed language were extracted. Strong unimodal baselines were first established with VideoMAE and FMAEIAT for vision, a CNN to LSTM pipeline for audio, and RobBERT for text, with the linguistic channel achieving the best single-modality score (F1 = 0.62). Building on these results, a Pairwise Cross-Modal Attention Network was introduced to learn explicit interactions between modalities. This architecture raised overall performance to F1 = 0.64, outperformed classic fusion techniques, and remained resilient when individual streams were noisy or absent. Ablation analyses showed that body-pose embeddings operate as a pivotal hub queried by the other modalities, while filtering out parental speech proved crucial for stable audio contributions. Beyond delivering a systematic benchmark for anxiety detection on the YOUth material, the findings reaffirm the diagnostic value of language yet demonstrate that carefully designed cross-modal attention can uncover complementary visual and acoustic cues that text alone misses. Although the current performance is not yet clinically sufficient, the research charts a clear path toward scalable, context-aware pre-screening tools for childhood anxiety and lays the groundwork for future extensions, including the use of VLLMs, alternative audio backbones, and dyadic modelling of parental behaviour.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	Leveraging multi modal deep-learning to predict anxiety symptoms from parent-child interactions
dc.title	Leveraging multi modal deep-learning to predict anxiety symptoms from parent-child interactions
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	52829

Files in this item

Name:: Sotiris Zenios MSc_Thesis.pdf
Size:: 11.44Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record