A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition

    Research output: Contribution to journalArticle

    6 Citations (Scopus)
    231 Downloads (Pure)

    Abstract

    Although stereo information has been extensively used in computer vision tasks recently, the incorporation of stereo visual information in Audio-Visual Speech Recognition (AVSR) systems and whether it can boost the speech accuracy still remains a largely undeveloped area. This paper addresses three fundamental issues in this area: 1) Will the stereo features benefit visual and audio-visual speech recognition? 2) If so, how much information is embedded in stereo features? 3) How to encode both planar and stereo information in a compact feature vector? In this study, we propose a comprehensive study on the characteristics of both planar and stereo visual features, and extensively analyse why the stereo information can boost the visual speech recognition. Based on the different information embedded in planar and stereo features, we present a new Cascade Hybrid Appearance Visual Feature (CHAVF) extraction scheme which successfully combines planar and stereo visual information into a compact feature vector, and evaluate this novel feature on visual and audio-visual connected digit recognition and isolated phrase recognition. The results show that stereo information is capable of significantly boosting the speech recognition, and the performance of our proposed visual feature outperforms the other commonly used appearance-based visual features on both the visual and audio-visual speech recognition tasks. Particularly, our proposed planar-stereo visual feature yields approximately 21% relative improvement over the planar visual feature. To the best of our knowledge, this is the first paper that extensively evaluates the different characteristics of planar and stereo visual features, and we first show that using the stereo feature along with the planar feature can significantly boost the accuracy on a large-scale audio-visual data corpus.
    Original languageEnglish
    Pages (from-to)26-38
    Number of pages13
    JournalSpeech Communication
    Volume90
    DOIs
    Publication statusPublished - Jun 2017

    Fingerprint Dive into the research topics of 'A Cascade Gray-Stereo Visual Feature Extraction Method For Visual and Audio-Visual Speech Recognition'. Together they form a unique fingerprint.

  • Cite this