Pose Estimation in Video
Summary
Human pose estimation in video has numerous applications, such as human activity analysis, automatic surveillance, human-computer interaction and markerless motion capture. It is challenging because of the kinematic structure of the human body and the variety of possible human poses, the endless appearance options caused by clothing and, finally, due to background clutter that can look like parts in the human body and confuse the system.
Current methods in human pose estimation either focus on specific situations, such as pedestrians or laboratory controlled motions, or sacrifice accuracy in favour of coping with videos containing any type of human activity. What we will show in this thesis is an improved system built upon the method of [Ramanan et al., 2007], which models a person's body configuration as a puppet of rectangles. The system first analyses all the frames from a video to find a specific pose from which it learns the appearance of the person to be tracked. Then it processes the video to detect the person in any possible pose.
We analysed the robustness of the original method by comparing pose estimations with labelled ground truth. We challenged the authors' claim that one set of parameters can fit multiple videos, which remains an open issue. Then, we extended the original method by including temporal information using two different types of motion models, which improved the tracking results. According to our qualitative evaluation of side-by-side tracking sequences, the new extensions resulted in more stable and accurate detections throughout time and are able to solve some challenging situations which arise when the motion is fast or body parts resemble each other. We found that the system performs poorly when detecting arms, due to their size, which remains the main problem to be solved in future work.