Enhancing Human Contact Signatures Estimation Through Multi-View Integration in Complex Parent-Infant Free Play Interactions
Summary
Understanding early parent–infant interactions, particularly those involving physical touch, is vital for assessing developmental progress and emotional bonding. Recent AI-based systems have shown promising results in automating physical contact
detection; however, most of them rely on single-view input and are highly liable to
occlusion, limiting their effectiveness in real-world settings. This thesis aimed to address that gap by extending the Image2Contact framework to incorporate multi-view
input for improved contact prediction during free-play interactions. Several fusion
approaches were implemented, including early feature concatenation, statistical decision fusion, logit-level fusion with fully connected heads, and attention-based feature
fusion. All models were trained and evaluated on the YOUth PCI dataset, which includes real-world, multi-camera recordings of parent–infant interactions. The results
showed that multi-view models consistently outperformed single-view baselines, though
the improvements were often modest. To understand this limited performance gain,
the analysis examined the influence of pose confidence, contact density, occlusion, and
body-region specificity on model behaviour. It revealed that prediction success was
strongly driven by physical contact density and the frequency of body-part involvement. Dense-contact scenes yielded better model performance, while sparse-contact
frames remained challenging. One important finding was that the model’s tendency to
over-predict was caused by an inconsistency between the evaluation metric and the loss
function that was used during training. This led the models to overestimate contact in
order to avoid harsh penalties for missed true positives. Although both single-view and
multi-view models were affected, multi-view architectures better handled low-contact
frames by leveraging cross-view spatial cues to make more precise predictions. Regionaware loss functions and threshold calibration improved single-view performance, in
some cases matching multi-view results. However, multi-view models saw limited benefit from these adjustments. These findings shed light on the factors that most influence performance and advance our understanding of the limitations and potentials of
single-view and multi-view systems in contact signature prediction for parent-infant
interactions. Furthermore, this study suggests that future systems should prioritize
data quality, employ region-aware, class-balanced loss functions aligned with loss and
evaluation metrics, and incorporate more sophisticated architectural designs to achieve
more precise predictions.