TY - BOOK
T1 - Characterising the correlation structure of high dimensional genomic datasets using a random matrix theory approach
AU - Feher, Kristen
PY - 2010
Y1 - 2010
N2 - The aim of genomic data analysis is to infer specific relationships amongst constituents of a complex system. Applied statistical methodology that was accordingly developed rely heavily on the sample gene covariance matrix, which requires regularisation when the number of genes is larger than the sample size. However, the constituents' individual behaviours coalesce into a single macroscopic system. Random Matrix Theory (RMT) is well suited to describing complex systems when constituent interactions become too complicated to model, and is becoming increasingly important in covariance estimation. Despite the great insight RMT can bring to high dimensional datasets, there is a lag in the practical application of the theoretical results. This thesis is an expository study of the benefits of applying RMT when mining high dimensional genomic data, by empirically characterising Arabidopsis microarray datasets. Firstly, the eigenvalue spacing distribution of each dataset's sample gene correlation matrix is examined with a matrix thresholding procedure. The application of graph-based community detection methods reveals that each correlation matrix has a substantial portion with block diagonal structure. Thus the novel combination of the existing methodologies just described find clusters that correspond to the block diagonal structure. The further examination of these clusters using RMT forms the basis for the second part of this thesis. A novel numerical procedure is then developed to estimate the dimension and dispersion of each cluster found, by comparing its spectrum against a one-signal model. This advance is one of the first practical applications of RMT that goes beyond a null hypothesis that the population covariance is either the identity matrix or a spiked identity model. Clusters are found to be nearly collinear, and more specific than commonly used methods. The specificity also corresponds to stronger Gene Ontology (GO) term overrepresentation. Comparison
AB - The aim of genomic data analysis is to infer specific relationships amongst constituents of a complex system. Applied statistical methodology that was accordingly developed rely heavily on the sample gene covariance matrix, which requires regularisation when the number of genes is larger than the sample size. However, the constituents' individual behaviours coalesce into a single macroscopic system. Random Matrix Theory (RMT) is well suited to describing complex systems when constituent interactions become too complicated to model, and is becoming increasingly important in covariance estimation. Despite the great insight RMT can bring to high dimensional datasets, there is a lag in the practical application of the theoretical results. This thesis is an expository study of the benefits of applying RMT when mining high dimensional genomic data, by empirically characterising Arabidopsis microarray datasets. Firstly, the eigenvalue spacing distribution of each dataset's sample gene correlation matrix is examined with a matrix thresholding procedure. The application of graph-based community detection methods reveals that each correlation matrix has a substantial portion with block diagonal structure. Thus the novel combination of the existing methodologies just described find clusters that correspond to the block diagonal structure. The further examination of these clusters using RMT forms the basis for the second part of this thesis. A novel numerical procedure is then developed to estimate the dimension and dispersion of each cluster found, by comparing its spectrum against a one-signal model. This advance is one of the first practical applications of RMT that goes beyond a null hypothesis that the population covariance is either the identity matrix or a spiked identity model. Clusters are found to be nearly collinear, and more specific than commonly used methods. The specificity also corresponds to stronger Gene Ontology (GO) term overrepresentation. Comparison
KW - Genomics
KW - Bioinformatics
KW - Analysis of covariance
KW - Collineation
KW - Random matrix theory
KW - Microarray data
KW - High dimensional inference
KW - Clustering
KW - Covariance estimation
M3 - Doctoral Thesis
ER -