Due to the differences in the appearance of an action from different viewpoints, variations in camera viewpoints often result in major problems for video-based action recognition. This paper exploits the complementary information in RGB and Depth streams of RGB-D videos to address those problems. We capitalize on the spatio-temporal information in the two data streams to extract action features that are largely insensitive to variations in camera viewpoints. We use the RGB data to compute dense trajectories that are translated to viewpoint insensitive deep features under a non-linear knowledge transfer model. Similarly, the Depth stream is used to extract convolutional neural network-based view invariant features on which Fourier Temporal Pyramid is computed to incorporate the temporal information. The heterogeneous features from the two streams are combined and used as a dictionary to predict the label of the test samples. To that end, we propose a sparse-dense collaborative representation classification scheme that strikes a balance between the discriminative abilities of the dense and the sparse representations of the samples over the extracted heterogeneous dictionary. To establish the effectiveness of our approach, we evaluate its performance on three public datasets and benchmark it against 12 existing techniques. Experiments demonstrate that our approach achieves up to 7.7% improvement in the accuracy over its nearest competitor.