Studies on microphone array processing and time-frequency masking for robust automatic speech recognition

Marco Kuhne

    Research output: ThesisDoctoral Thesis

    157 Downloads (Pure)

    Abstract

    In order to deploy automatic speech recognition technology in real world scenarios it is necessary to handle cocktail-party-like environments with multiple speech and noise sources. Despite several decades of research the noise robustness of state-of-the-art speech recognizers still falls short in comparison with human capabilities. This dissertation focuses on the development of computational models for automatic speech separation and recognition inmulti-talker environments. Several aspects of cocktail-party processing are studied, ranging from source localization over source separation to speech recognition. The approach considered aims for a closer integration of microphone array processing and missing feature techniques for noise robust speech recognition. Front- and backend processing are linked together through the consistent application of time-frequency masking for both source separation and speech recognition. The use of cluster algorithms for automatically estimating these masks on the basis of spatial localization cues is investigated for anechoic and reverberant data. The incorporation of spatial observation weights and time-frequency context information is proposed as a means to increase the localization and segmentation performance of standard fuzzy cluster algorithms, particularly in echoic conditions. While the former helps to improve the localization accuracy by ignoring noisy observations the latter smoothes the fuzzy cluster membership levels by exploiting the high correlation of neighboring mask points. The resulting robust fuzzy cluster algorithm is integrated into a source separation system which combines the advantages of time-frequency masking with the separation capabilities of adaptive beamforming. The thesis also investigates a novel evidence model in the form of the bounded-Gauss-Uniform mixture probability density function for missing data speech recognition. In a number of simulated cocktail-party scenarios it is observed that th
    Original languageEnglish
    QualificationDoctor of Philosophy
    Publication statusUnpublished - 2009

    Fingerprint Dive into the research topics of 'Studies on microphone array processing and time-frequency masking for robust automatic speech recognition'. Together they form a unique fingerprint.

    Cite this