Leveraging multi modal deep-learning to predict anxiety symptoms from parent-child interactions
Summary
Early, unobtrusive detection of anxiety in children can facilitate timelier
and more equitable access to mental-health interventions. This thesis developed
and evaluated a deep-learning pipeline that predicts anxiety-related behaviours
from recordings of parent–child interactions collected in the YOUth
Cohort Study. The dataset comprised 100 dyads of nine-year-old children
filmed in conflict and cooperative tasks. After automatic diarisation, four
synchronised child-centred streams, which include facial expressions, body
posture, speech acoustics and transcribed language were extracted.
Strong unimodal baselines were first established with VideoMAE and FMAEIAT
for vision, a CNN to LSTM pipeline for audio, and RobBERT for text,
with the linguistic channel achieving the best single-modality score (F1 =
0.62). Building on these results, a Pairwise Cross-Modal Attention Network
was introduced to learn explicit interactions between modalities. This architecture
raised overall performance to F1 = 0.64, outperformed classic fusion
techniques, and remained resilient when individual streams were noisy or
absent. Ablation analyses showed that body-pose embeddings operate as
a pivotal hub queried by the other modalities, while filtering out parental
speech proved crucial for stable audio contributions.
Beyond delivering a systematic benchmark for anxiety detection on the YOUth
material, the findings reaffirm the diagnostic value of language yet demonstrate
that carefully designed cross-modal attention can uncover complementary
visual and acoustic cues that text alone misses. Although the current
performance is not yet clinically sufficient, the research charts a clear path
toward scalable, context-aware pre-screening tools for childhood anxiety and
lays the groundwork for future extensions, including the use of VLLMs, alternative
audio backbones, and dyadic modelling of parental behaviour.