Enhancing Human Contact
Signatures Estimation Through
Multi-View Integration in Complex
Parent-Infant Free Play
Interactions

Alexandrou, Andreas

View/Open

Final Thesis Andreas Alexandrou.pdf (7.076Mb)

Publication date

2025

Author

Alexandrou, Andreas

Metadata

Show full item record

Summary

Understanding early parent–infant interactions, particularly those involving physical touch, is vital for assessing developmental progress and emotional bonding. Recent AI-based systems have shown promising results in automating physical contact detection; however, most of them rely on single-view input and are highly liable to occlusion, limiting their effectiveness in real-world settings. This thesis aimed to address that gap by extending the Image2Contact framework to incorporate multi-view input for improved contact prediction during free-play interactions. Several fusion approaches were implemented, including early feature concatenation, statistical decision fusion, logit-level fusion with fully connected heads, and attention-based feature fusion. All models were trained and evaluated on the YOUth PCI dataset, which includes real-world, multi-camera recordings of parent–infant interactions. The results showed that multi-view models consistently outperformed single-view baselines, though the improvements were often modest. To understand this limited performance gain, the analysis examined the influence of pose confidence, contact density, occlusion, and body-region specificity on model behaviour. It revealed that prediction success was strongly driven by physical contact density and the frequency of body-part involvement. Dense-contact scenes yielded better model performance, while sparse-contact frames remained challenging. One important finding was that the model’s tendency to over-predict was caused by an inconsistency between the evaluation metric and the loss function that was used during training. This led the models to overestimate contact in order to avoid harsh penalties for missed true positives. Although both single-view and multi-view models were affected, multi-view architectures better handled low-contact frames by leveraging cross-view spatial cues to make more precise predictions. Regionaware loss functions and threshold calibration improved single-view performance, in some cases matching multi-view results. However, multi-view models saw limited benefit from these adjustments. These findings shed light on the factors that most influence performance and advance our understanding of the limitations and potentials of single-view and multi-view systems in contact signature prediction for parent-infant interactions. Furthermore, this study suggests that future systems should prioritize data quality, employ region-aware, class-balanced loss functions aligned with loss and evaluation metrics, and incorporate more sophisticated architectural designs to achieve more precise predictions.

URI

https://studenttheses.uu.nl/handle/20.500.12932/49905

Collections

Theses