Viewpoint variation is a major challenge in video- based human action recognition. We exploit the simultaneous RGB and Depth sensing of RGB-D cameras to address this problem. Our technique capitalizes on the complementary spatio-temporal information in RGB and Depth frames of the RGB-D videos to achieve viewpoint invariant action recognition. We extract view invariant features from the dense trajectories of the RGB stream using a non-linear knowledge transfer model. Simultaneously, view invariant human pose features are extracted using a CNN model for the Depth stream, and Fourier Temporal Pyramid are computed over them. The resulting heterogeneous features are meticulously combined and used for training an L 1 L 2 classifier. To establish the effectiveness of the proposed approach, we benchmark our technique using two standard datasets and compare its performance with twelve existing methods. Our approach achieves up to 7.2% improvement in the accuracy over the nearest competitor.
|Publication status||Published - 2018|