Synthesising 2D images of adult-child interaction for human pose estimation
Summary
In the last few years, Human pose estimators have gotten better at predicting the pose of people, especially adults. Pose estimators still struggle with occlusions and the performance on children has not improved as much either due to a lack of child-specific pose data and child bodies having different proportions. Adult-child interaction is even more difficult as it has lots of occlusions and people with vastly different body sizes. This is unfortunate as it could potentially be really helpful. Human pose estimation can be applied in many areas, e.g., in human-computer interaction, healthcare, or behavioural sciences.
In this research, I try to improve pose estimators’ performance on adult-child interactions by synthesising data of said interaction. Authors of other studies have tried to solve the a of pose data by synthesising pose data using motion capture data. I tried a different approach, synthesising data by adjusting 3D human models’ poses in Unity. The adult and child models are adjusted such that they interact. In total I created 40 different interaction scenes, which I used to create 40,571 2D images of the interaction. During the synthesising I diversified the aesthetics to create a varied set of images. Unity automatically added precise annotations, 2D and 3D keypoint locations a.o., such that I was able to use these images to finetune four state-of-the-art human pose estimators (HigherHRNet-W32, HigherHRNet-W48, HRNet-W48 and Stacked Hourglass).
To finetune the models, I combined a subset of the synthesised images (27,042 images) with COCO training data and trained on the combination. I evaluated their performance on a Youth images test set for which I annotated 520 challenging images of adult-child interaction. These images are challenging due to occlusions, self-occlusions, people blending in with the background and keypoints falling outside of the camera bounds. I found that the models’ AP improved by 1.81, 0.72, 1.14 and 1.55 after finetuning while the AR performance improvements are larger, 2.52, 2.17, 1.35 and 1.25 respectively. These improvement show that motion capture data is not necessary for synthesising images that can improve pose estimators' performance on adult-child interaction data.