TY - BOOK

T1 - Characterising the correlation structure of high dimensional genomic datasets using a random matrix theory approach

AU - Feher, Kristen

PY - 2010

Y1 - 2010

N2 - The aim of genomic data analysis is to infer specific relationships amongst constituents of a complex system. Applied statistical methodology that was accordingly developed rely heavily on the sample gene covariance matrix, which requires regularisation when the number of genes is larger than the sample size. However, the constituents' individual behaviours coalesce into a single macroscopic system. Random Matrix Theory (RMT) is well suited to describing complex systems when constituent interactions become too complicated to model, and is becoming increasingly important in covariance estimation. Despite the great insight RMT can bring to high dimensional datasets, there is a lag in the practical application of the theoretical results. This thesis is an expository study of the benefits of applying RMT when mining high dimensional genomic data, by empirically characterising Arabidopsis microarray datasets. Firstly, the eigenvalue spacing distribution of each dataset's sample gene correlation matrix is examined with a matrix thresholding procedure. The application of graph-based community detection methods reveals that each correlation matrix has a substantial portion with block diagonal structure. Thus the novel combination of the existing methodologies just described find clusters that correspond to the block diagonal structure. The further examination of these clusters using RMT forms the basis for the second part of this thesis. A novel numerical procedure is then developed to estimate the dimension and dispersion of each cluster found, by comparing its spectrum against a one-signal model. This advance is one of the first practical applications of RMT that goes beyond a null hypothesis that the population covariance is either the identity matrix or a spiked identity model. Clusters are found to be nearly collinear, and more specific than commonly used methods. The specificity also corresponds to stronger Gene Ontology (GO) term overrepresentation. Comparison

AB - The aim of genomic data analysis is to infer specific relationships amongst constituents of a complex system. Applied statistical methodology that was accordingly developed rely heavily on the sample gene covariance matrix, which requires regularisation when the number of genes is larger than the sample size. However, the constituents' individual behaviours coalesce into a single macroscopic system. Random Matrix Theory (RMT) is well suited to describing complex systems when constituent interactions become too complicated to model, and is becoming increasingly important in covariance estimation. Despite the great insight RMT can bring to high dimensional datasets, there is a lag in the practical application of the theoretical results. This thesis is an expository study of the benefits of applying RMT when mining high dimensional genomic data, by empirically characterising Arabidopsis microarray datasets. Firstly, the eigenvalue spacing distribution of each dataset's sample gene correlation matrix is examined with a matrix thresholding procedure. The application of graph-based community detection methods reveals that each correlation matrix has a substantial portion with block diagonal structure. Thus the novel combination of the existing methodologies just described find clusters that correspond to the block diagonal structure. The further examination of these clusters using RMT forms the basis for the second part of this thesis. A novel numerical procedure is then developed to estimate the dimension and dispersion of each cluster found, by comparing its spectrum against a one-signal model. This advance is one of the first practical applications of RMT that goes beyond a null hypothesis that the population covariance is either the identity matrix or a spiked identity model. Clusters are found to be nearly collinear, and more specific than commonly used methods. The specificity also corresponds to stronger Gene Ontology (GO) term overrepresentation. Comparison

KW - Genomics

KW - Bioinformatics

KW - Analysis of covariance

KW - Collineation

KW - Random matrix theory

KW - Microarray data

KW - High dimensional inference

KW - Clustering

KW - Covariance estimation

M3 - Doctoral Thesis

ER -