Human Interaction Recognition from Video
Summary
Human action recognition has received much attention during the last decade. Inspired by the work by Patron-Perez et al., we construct a pipeline that recognizes 7 person-to-person interactions from video. We focus on urban settings and use videos that have been recorded using consumer hand-held digital cameras. Full bodies are visible, but occlusions do occur over time.
We use the part based models by Felzenszwalb et al. to detect each person in every frame of the video. Human detections are tracked over time using the human tracking framework by Choi et al. Similar to Wang et al., we compute dense trajectories for each track. The orientation of each person is estimated and pairs of people are formed that are simultaneously on screen. We classify the interaction of each pair using a multiclass SVM.
We look into the performance of each part of the pipeline in order to understand its relative strengths and weaknesses. The human tracking approach leads to good results on the ETH data set, but the results on the Collective Activity data set leave room for improvement. We obtain an accuracy of 34.97% while we try to classify 8 classes of orientations of the Collective Activity data set. Many wrong classifications lie in closely related classes. For the final interaction recognition we obtain an overall accuracy of 53.57%. Orientation estimation has an important influence on the overall performance, but the influence of the dense trajectories and the distance between the targets is limited.