One highlight of deep networks is their ability to automatically learn useful representations from raw inputs. Hence, domain knowledge, generic priors and feature extraction are no longer needed. It is however still a challenging task to train deep networks, because they require extensive human expertise to choose the type of model and its parameters to ensure that the system is properly trained. In this work, we propose a novel feature learning framework based on the marginalised stacked auto-encoder which does not need practitioners to have any deep learning specific knowledge. We applied this method on visual speech recognition, and the performance of our proposed method outperforms the other feature extraction methods with a 2% improvement in the accuracy for speaker independent systems. This method is also a universal solution which can be used for any deep learning based tasks. Therefore, we also verified our method on a popular hand written digit recognition database MNIST, and experimental results showed that our proposed method with an error rate of 1.30% is comparable to the best models tuned by experts.