The success of blind source separation systems relies on its ability to handle adverse realworld conditions, and audio applications of blind source separation systems are often faced with the cocktail party problem of multiple simultaneously active speakers. The convolutive nature of the mixing of the source signals and the relative number of the microphones and source signals also add to the difficulty of the blind source separation problem. This thesis focuses on the sparseness-based time-frequency masking approach for under-determined blind source separation. In particular, we direct our attention upon the automatic estimation of time-frequency masks based on the full-band clustering of spatial features as extracted from the microphone observations.
We explore recent advances of full-band clustering-based techniques to blind source separation and investigate and extend an existing technique termed the multiple sensors degenerate unmixing estimation technique (MENUET). We modify the MENUET by using an alternative clustering scheme for mask estimation, the fuzzy c-means, and present comprehensive evaluations in a range of environments to establish its feasibility. We then explore two extensions to the fuzzy c-means: firstly, the estimation of reliability weights for the spatial features, and secondly the inclusion of contextual information into the clustering objective function.
This thesis also investigates other full-band clustering techniques such as the Gaussian mixture model and Watson mixture model for mask estimation. We evaluate and compare the performance of all the full-band clustering techniques in this thesis, and conclusions are drawn as to which are the most robust for a variety of both simulated and real-world conditions, including international benchmark data sets. Finally, to remove any requirement on a priori knowledge on the number of source signals, we consider two novel approaches to source number estimation. The first uses an adaptive optimization scheme based on the fuzzy c-means clustering algorithm, whilst the second approach considers the full-band clustering of speech activity sequences with the Watson mixture model.
|Qualification||Doctor of Philosophy|
|Publication status||Unpublished - 2014|