A comparison of Bayesian classification trees and random forest to identify classifiers for childhood leukaemia

R.A. O'Leary, R.W. Francis, Kim Carter, Martin Firth, Ursula Kees, Nicholas De Klerk

Research output: Chapter in Book/Conference paperConference paperpeer-review

3 Citations (Scopus)

Abstract

Recently, microarrays technologies have been extensively used to distinguish gene expression in acute lymphoblastic leukaemia (ALL) (e.g. Pui et al., 2004; Hoffmann et al., 2008). ALL is the most common type of leukaemia diagnosed in children, with an incidence rate of about 4 per 100,000 per year (Pizzo and Poplack, 2001; Milne et al., 2008). There are six main subtypes of leukaemia, one of which is T-cell acute lymphoblastic leukaemia (T-ALL) which generally has lower cure rates than other forms of ALL. Ribonucleic acid (RNA) samples from each patient can be put onto microarrays to provide gene expression levels for around 20 thousand genes (depending on which microarray chip is used). One of the challenges with microarray analysis in leukaemia research is identifying the smallest possible set of genes that predict relapse with the highest predictive performance. Currently, one approach used to identify important differentially expressed genes is Random Forest (RF) (e.g. Hoffmann, 2006; Díaz-Uriarte and Alvarez de Andrés, 2006). RF is a classifier that consists of an ensemble of classification trees, and yields the average class for each Y observation (each patient). Díaz-Uriarte and Alvarez de Andrés (2006) identified the characteristics that make RF ideal for microarray data, these include: RF can handle more variables than observations (large p small n problems); RF can be applied to binary and multi-class problems; RF has good predictive performance for datasets containing a large number of noise variables and does not overfit; RF can use both categorical and continuous predictors and investigates interactions; the results from RF are unaltered by monotone transformations of the variables; a free R library exists that performs RF; RF provides measures of variable importance and for the most part one does not have to fine-tune parameters to obtain good predictive performance. This paper describes an alternative approach to identifying a gene classifier for predicting relapse in ALL. Bayesian approaches to classification and regression trees (BCART) were proposed by Chipman et al. (1998), Denison et al. (1998) and Buntine (1992). BCART identifies “good” trees using a stochastic search algorithm that applies a reversible jump Markov chain Monte Carlo method. The set of best trees are selected that have the highest prediction accuracy (O'Leary et al. 2008). Fan and Gray (2005) gave BCART an A+ for interpretability and B+ for prediction. To date, BCART has been largely based on “non-informative”, usually conjugate priors. Moreover, there are only a few real-world applications of BCART (Lamon & Stow, 2004; Partridge et al., 2006; Schetinin et al., 2007). This statistical approach has not been applied to large p small n problems (to the author's knowledge). Here we compare RF and BCART for predicting relapse in three ALL datasets, using gene expression values as the covariates. In all three datasets, the best tree identified from BCART had better accuracy and in particular better prediction of relapse (higher sensitivity) than RF. BCART also had better performance than RF in identifying important genes that predicts whether a patient will relapse. © MODSIM 2009.All rights reserved.
Original languageEnglish
Title of host publicationA comparison of Bayesian classification trees and random forest to identify classifiers for childhood leukaemia
EditorsR.S. Anderssen, R.D. Braddock, L.T.H. Newham
Place of PublicationCairns, Australia
PublisherModelling and Simulation Society of Australia and New Zealand Inc.
Pages369-375
VolumeNA
EditionCairns, Australia
ISBN (Print)9780975840078
Publication statusPublished - 2009
Event18th World IMACS Congress and International Congress on Modelling and Simulation: Interfacing Modelling and Simulation with Mathematical and Computational Sciences, MODSIM 2009 - Cairns, Australia
Duration: 13 Jul 200917 Jul 2009

Conference

Conference18th World IMACS Congress and International Congress on Modelling and Simulation: Interfacing Modelling and Simulation with Mathematical and Computational Sciences, MODSIM 2009
Country/TerritoryAustralia
CityCairns
Period13/07/0917/07/09

Fingerprint

Dive into the research topics of 'A comparison of Bayesian classification trees and random forest to identify classifiers for childhood leukaemia'. Together they form a unique fingerprint.

Cite this