Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data

International Inflammatory Bowel Disease Genetics Consortium (IIBDGC)

Research output: Contribution to journalArticle

Abstract

Crohn Disease (CD) is a complex genetic disorder for which more than 140 genes have been identified using genome wide association studies (GWAS). However, the genetic architecture of the trait remains largely unknown. The recent development of machine learning (ML) approaches incited us to apply them to classify healthy and diseased people according to their genomic information. The Immunochip dataset containing 18,227 CD patients and 34,050 healthy controls enrolled and genotyped by the international Inflammatory Bowel Disease genetic consortium (IIBDGC) has been re-analyzed using a set of ML methods: penalized logistic regression (LR), gradient boosted trees (GBT) and artificial neural networks (NN). The main score used to compare the methods was the Area Under the ROC Curve (AUC) statistics. The impact of quality control (QC), imputing and coding methods on LR results showed that QC methods and imputation of missing genotypes may artificially increase the scores. At the opposite, neither the patient/control ratio nor marker preselection or coding strategies significantly affected the results. LR methods, including Lasso, Ridge and ElasticNet provided similar results with a maximum AUC of 0.80. GBT methods like XGBoost, LightGBM and CatBoost, together with dense NN with one or more hidden layers, provided similar AUC values, suggesting limited epistatic effects in the genetic architecture of the trait. ML methods detected near all the genetic variants previously identified by GWAS among the best predictors plus additional predictors with lower effects. The robustness and complementarity of the different methods are also studied. Compared to LR, non-linear models such as GBT or NN may provide robust complementary approaches to identify and classify genetic markers.

Original languageEnglish
Article number10351
JournalScientific Reports
Volume9
Issue number1
DOIs
Publication statusPublished - 17 Jul 2019

Fingerprint

Crohn Disease
Genome
Logistic Models
ROC Curve
Area Under Curve
Genome-Wide Association Study
Quality Control
Inborn Genetic Diseases
Nonlinear Dynamics
Machine Learning
Genetic Markers
Inflammatory Bowel Diseases
Genotype
Genes

Cite this

International Inflammatory Bowel Disease Genetics Consortium (IIBDGC). / Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data. In: Scientific Reports. 2019 ; Vol. 9, No. 1.
@article{ba49e98da73647d39a28e264aac0db90,
title = "Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data",
abstract = "Crohn Disease (CD) is a complex genetic disorder for which more than 140 genes have been identified using genome wide association studies (GWAS). However, the genetic architecture of the trait remains largely unknown. The recent development of machine learning (ML) approaches incited us to apply them to classify healthy and diseased people according to their genomic information. The Immunochip dataset containing 18,227 CD patients and 34,050 healthy controls enrolled and genotyped by the international Inflammatory Bowel Disease genetic consortium (IIBDGC) has been re-analyzed using a set of ML methods: penalized logistic regression (LR), gradient boosted trees (GBT) and artificial neural networks (NN). The main score used to compare the methods was the Area Under the ROC Curve (AUC) statistics. The impact of quality control (QC), imputing and coding methods on LR results showed that QC methods and imputation of missing genotypes may artificially increase the scores. At the opposite, neither the patient/control ratio nor marker preselection or coding strategies significantly affected the results. LR methods, including Lasso, Ridge and ElasticNet provided similar results with a maximum AUC of 0.80. GBT methods like XGBoost, LightGBM and CatBoost, together with dense NN with one or more hidden layers, provided similar AUC values, suggesting limited epistatic effects in the genetic architecture of the trait. ML methods detected near all the genetic variants previously identified by GWAS among the best predictors plus additional predictors with lower effects. The robustness and complementarity of the different methods are also studied. Compared to LR, non-linear models such as GBT or NN may provide robust complementary approaches to identify and classify genetic markers.",
author = "{International Inflammatory Bowel Disease Genetics Consortium (IIBDGC)} and Alberto Romagnoni and Simon J{\'e}gou and {Van Steen}, Kristel and Gilles Wainrib and Hugot, {Jean Pierre} and Laurent Peyrin-Biroulet and Mathias Chamaillard and Colombel, {Jean Frederick} and Mario Cottone and Mauro D’Amato and Renata D’Inc{\`a} and Jonas Halfvarson and Paul Henderson and Amir Karban and Kennedy, {Nicholas A.} and Khan, {Mohammed Azam} and Marc L{\'e}mann and Arie Levine and Dunecan Massey and Monica Milla and Ng, {Sok Meng} and Ioannis Oikonomou and Harald Peeters and Proctor, {Deborah D.} and Rahier, {Jean Francois} and Paul Rutgeerts and Frank Seibold and Laura Stronati and Taylor, {Kirstin M.} and Leif T{\"o}rkvist and Kullak Ublick and {Van Limbergen}, Johan and {Van Gossum}, Andre and Vatn, {Morten H.} and Hu Zhang and Wei Zhang and Andrews, {Jane M.} and Bampton, {Peter A.} and Murray Barclay and Florin, {Timothy H.} and Richard Gearry and Krupa Krishnaprasad and Lawrance, {Ian C.} and Gillian Mahy and Montgomery, {Grant W.} and Graham Radford-Smith and Roberts, {Rebecca L.} and Simms, {Lisa A.} and Katherine Hanigan and Anthony Croft and Leila Amininijad and Isabelle Cleynen and Olivier Dewit and Denis Franchimont and Michel Georges and Debby Laukens and Harald Peeters and Rahier, {Jean Francois} and Paul Rutgeerts and Emilie Theatre and {Van Gossum}, Andr{\'e} and Severine Vermeire and Guy Aumais and Leonard Baidoo and Barrie, {Arthur M.} and Karen Beck and Bernard, {Edmond Jean} and Binion, {David G.} and Alain Bitton and Brant, {Steve R.} and Cho, {Judy H.} and Albert Cohen and Kenneth Croitoru and Daly, {Mark J.} and Datta, {Lisa W.} and Colette Deslandres and Duerr, {Richard H.} and Debra Dutridge and John Ferguson and Joann Fultz and Philippe Goyette and Greenberg, {Gordon R.} and Talin Haritunians and Gilles Jobin and Seymour Katz and Lahaie, {Raymond G.} and McGovern, {Dermot P.} and Linda Nelson and Ng, {Sok Meng} and Kaida Ning and Ioannis Oikonomou and Pierre Par{\'e} and Proctor, {Deborah D.} and Regueiro, {Miguel D.} and Rioux, {John D.} and Elizabeth Ruggiero and Schumm, {L. Philip} and Marc Schwartz and Regan Scott and Yashoda Sharma and Silverberg, {Mark S.} and Denise Spears and Steinhart, {A. Hillary} and Stempak, {Joanne M.} and Swoger, {Jason M.} and Constantina Tsagarelis and Wei Zhang and Clarence Zhang and Hongyu Zhao and Jan Aerts and Tariq Ahmad and Hazel Arbury and Anthony Attwood and Adam Auton and Ball, {Stephen G.} and Balmforth, {Anthony J.} and Chris Barnes and Barrett, {Jeffrey C.} and In{\^e}s Barroso and Anne Barton and Bennett, {Amanda J.} and Sanjeev Bhaskar and Katarzyna Blaszczyk and John Bowes and Brand, {Oliver J.} and Braund, {Peter S.} and Francesca Bredin and Gerome Breen and Brown, {Morris J.} and Bruce, {Ian N.} and Jaswinder Bull and Burren, {Oliver S.} and John Burton and Jake Byrnes and Sian Caesar and Niall Cardin and Clee, {Chris M.} and Coffey, {Alison J.} and {MC Connell}, John and Conrad, {Donald F.} and Cooper, {Jason D.} and Dominiczak, {Anna F.} and Kate Downes and Drummond, {Hazel E.} and Darshna Dudakia and Andrew Dunham and Bernadette Ebbs and Diana Eccles and Sarah Edkins and Cathryn Edwards and Anna Elliot and Paul Emery and Evans, {David M.} and Gareth Evans and Steve Eyre and Anne Farmer and Ferrier, {I. Nicol} and Edward Flynn and Alistair Forbes and Liz Forty and Franklyn, {Jayne A.} and Frayling, {Timothy M.} and Freathy, {Rachel M.} and Eleni Giannoulatou and Polly Gibbs and Paul Gilbert and Katherine Gordon-Smith and Emma Gray and Elaine Green and Groves, {Chris J.} and Detelina Grozeva and Rhian Gwilliam and Anita Hall and Naomi Hammond and Matt Hardy and Pile Harrison and Neelam Hassanali and Husam Hebaishi and Sarah Hines and Anne Hinks and Hitman, {Graham A.} and Lynne Hocking and Chris Holmes and Eleanor Howard and Philip Howard and Howson, {Joanna M.M.} and Debbie Hughes and Sarah Hunt and Isaacs, {John D.} and Mahim Jain and Jewell, {Derek P.} and Toby Johnson and Jolley, {Jennifer D.} and Jones, {Ian R.} and Jones, {Lisa A.} and George Kirov and Langford, {Cordelia F.} and Hana Lango-Allen and Lathrop, {G. Mark} and James Lee and Lee, {Kate L.} and Charlie Lees and Kevin Lewis and Lindgren, {Cecilia M.} and Meeta Maisuria-Armer and Julian Maller and John Mansfield and Marchini, {Jonathan L.} and Paul Martin and Massey, {Dunecan Co} and McArdle, {Wendy L.} and Peter McGuffin and McLay, {Kirsten E.} and Gil McVean and Alex Mentzer and Mimmack, {Michael L.} and Morgan, {Ann E.} and Morris, {Andrew P.} and Craig Mowat and Munroe, {Patricia B.} and Simon Myers and William Newman and Nimmo, {Elaine R.} and O’Donovan, {Michael C.} and Abiodun Onipinla and Ovington, {Nigel R.} and Owen, {Michael J.} and Kimmo Palin and Aarno Palotie and Kirstie Parnell and Richard Pearson and David Pernet and Perry, {John Rb} and Anne Phillips and Vincent Plagnol and Prescott, {Natalie J.} and Inga Prokopenko and Quail, {Michael A.} and Suzanne Rafelt and Rayner, {Nigel W.} and Reid, {David M.} and Anthony Renwick and Wendy Thomson and Brown, {Matthew A.} and Burton, {Paul R.} and Hall, {Alistair S.} and Blackwell, {Jenefer M.} and Wood, {Nicholas W.} and Spencer, {Chris C.A.}",
year = "2019",
month = "7",
day = "17",
doi = "10.1038/s41598-019-46649-z",
language = "English",
volume = "9",
journal = "Scientific Reports",
issn = "2045-2322",
publisher = "Nature Publishing Group - Macmillan Publishers",
number = "1",

}

Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data. / International Inflammatory Bowel Disease Genetics Consortium (IIBDGC).

In: Scientific Reports, Vol. 9, No. 1, 10351, 17.07.2019.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data

AU - International Inflammatory Bowel Disease Genetics Consortium (IIBDGC)

AU - Romagnoni, Alberto

AU - Jégou, Simon

AU - Van Steen, Kristel

AU - Wainrib, Gilles

AU - Hugot, Jean Pierre

AU - Peyrin-Biroulet, Laurent

AU - Chamaillard, Mathias

AU - Colombel, Jean Frederick

AU - Cottone, Mario

AU - D’Amato, Mauro

AU - D’Incà, Renata

AU - Halfvarson, Jonas

AU - Henderson, Paul

AU - Karban, Amir

AU - Kennedy, Nicholas A.

AU - Khan, Mohammed Azam

AU - Lémann, Marc

AU - Levine, Arie

AU - Massey, Dunecan

AU - Milla, Monica

AU - Ng, Sok Meng

AU - Oikonomou, Ioannis

AU - Peeters, Harald

AU - Proctor, Deborah D.

AU - Rahier, Jean Francois

AU - Rutgeerts, Paul

AU - Seibold, Frank

AU - Stronati, Laura

AU - Taylor, Kirstin M.

AU - Törkvist, Leif

AU - Ublick, Kullak

AU - Van Limbergen, Johan

AU - Van Gossum, Andre

AU - Vatn, Morten H.

AU - Zhang, Hu

AU - Zhang, Wei

AU - Andrews, Jane M.

AU - Bampton, Peter A.

AU - Barclay, Murray

AU - Florin, Timothy H.

AU - Gearry, Richard

AU - Krishnaprasad, Krupa

AU - Lawrance, Ian C.

AU - Mahy, Gillian

AU - Montgomery, Grant W.

AU - Radford-Smith, Graham

AU - Roberts, Rebecca L.

AU - Simms, Lisa A.

AU - Hanigan, Katherine

AU - Croft, Anthony

AU - Amininijad, Leila

AU - Cleynen, Isabelle

AU - Dewit, Olivier

AU - Franchimont, Denis

AU - Georges, Michel

AU - Laukens, Debby

AU - Peeters, Harald

AU - Rahier, Jean Francois

AU - Rutgeerts, Paul

AU - Theatre, Emilie

AU - Van Gossum, André

AU - Vermeire, Severine

AU - Aumais, Guy

AU - Baidoo, Leonard

AU - Barrie, Arthur M.

AU - Beck, Karen

AU - Bernard, Edmond Jean

AU - Binion, David G.

AU - Bitton, Alain

AU - Brant, Steve R.

AU - Cho, Judy H.

AU - Cohen, Albert

AU - Croitoru, Kenneth

AU - Daly, Mark J.

AU - Datta, Lisa W.

AU - Deslandres, Colette

AU - Duerr, Richard H.

AU - Dutridge, Debra

AU - Ferguson, John

AU - Fultz, Joann

AU - Goyette, Philippe

AU - Greenberg, Gordon R.

AU - Haritunians, Talin

AU - Jobin, Gilles

AU - Katz, Seymour

AU - Lahaie, Raymond G.

AU - McGovern, Dermot P.

AU - Nelson, Linda

AU - Ng, Sok Meng

AU - Ning, Kaida

AU - Oikonomou, Ioannis

AU - Paré, Pierre

AU - Proctor, Deborah D.

AU - Regueiro, Miguel D.

AU - Rioux, John D.

AU - Ruggiero, Elizabeth

AU - Schumm, L. Philip

AU - Schwartz, Marc

AU - Scott, Regan

AU - Sharma, Yashoda

AU - Silverberg, Mark S.

AU - Spears, Denise

AU - Steinhart, A. Hillary

AU - Stempak, Joanne M.

AU - Swoger, Jason M.

AU - Tsagarelis, Constantina

AU - Zhang, Wei

AU - Zhang, Clarence

AU - Zhao, Hongyu

AU - Aerts, Jan

AU - Ahmad, Tariq

AU - Arbury, Hazel

AU - Attwood, Anthony

AU - Auton, Adam

AU - Ball, Stephen G.

AU - Balmforth, Anthony J.

AU - Barnes, Chris

AU - Barrett, Jeffrey C.

AU - Barroso, Inês

AU - Barton, Anne

AU - Bennett, Amanda J.

AU - Bhaskar, Sanjeev

AU - Blaszczyk, Katarzyna

AU - Bowes, John

AU - Brand, Oliver J.

AU - Braund, Peter S.

AU - Bredin, Francesca

AU - Breen, Gerome

AU - Brown, Morris J.

AU - Bruce, Ian N.

AU - Bull, Jaswinder

AU - Burren, Oliver S.

AU - Burton, John

AU - Byrnes, Jake

AU - Caesar, Sian

AU - Cardin, Niall

AU - Clee, Chris M.

AU - Coffey, Alison J.

AU - MC Connell, John

AU - Conrad, Donald F.

AU - Cooper, Jason D.

AU - Dominiczak, Anna F.

AU - Downes, Kate

AU - Drummond, Hazel E.

AU - Dudakia, Darshna

AU - Dunham, Andrew

AU - Ebbs, Bernadette

AU - Eccles, Diana

AU - Edkins, Sarah

AU - Edwards, Cathryn

AU - Elliot, Anna

AU - Emery, Paul

AU - Evans, David M.

AU - Evans, Gareth

AU - Eyre, Steve

AU - Farmer, Anne

AU - Ferrier, I. Nicol

AU - Flynn, Edward

AU - Forbes, Alistair

AU - Forty, Liz

AU - Franklyn, Jayne A.

AU - Frayling, Timothy M.

AU - Freathy, Rachel M.

AU - Giannoulatou, Eleni

AU - Gibbs, Polly

AU - Gilbert, Paul

AU - Gordon-Smith, Katherine

AU - Gray, Emma

AU - Green, Elaine

AU - Groves, Chris J.

AU - Grozeva, Detelina

AU - Gwilliam, Rhian

AU - Hall, Anita

AU - Hammond, Naomi

AU - Hardy, Matt

AU - Harrison, Pile

AU - Hassanali, Neelam

AU - Hebaishi, Husam

AU - Hines, Sarah

AU - Hinks, Anne

AU - Hitman, Graham A.

AU - Hocking, Lynne

AU - Holmes, Chris

AU - Howard, Eleanor

AU - Howard, Philip

AU - Howson, Joanna M.M.

AU - Hughes, Debbie

AU - Hunt, Sarah

AU - Isaacs, John D.

AU - Jain, Mahim

AU - Jewell, Derek P.

AU - Johnson, Toby

AU - Jolley, Jennifer D.

AU - Jones, Ian R.

AU - Jones, Lisa A.

AU - Kirov, George

AU - Langford, Cordelia F.

AU - Lango-Allen, Hana

AU - Lathrop, G. Mark

AU - Lee, James

AU - Lee, Kate L.

AU - Lees, Charlie

AU - Lewis, Kevin

AU - Lindgren, Cecilia M.

AU - Maisuria-Armer, Meeta

AU - Maller, Julian

AU - Mansfield, John

AU - Marchini, Jonathan L.

AU - Martin, Paul

AU - Massey, Dunecan Co

AU - McArdle, Wendy L.

AU - McGuffin, Peter

AU - McLay, Kirsten E.

AU - McVean, Gil

AU - Mentzer, Alex

AU - Mimmack, Michael L.

AU - Morgan, Ann E.

AU - Morris, Andrew P.

AU - Mowat, Craig

AU - Munroe, Patricia B.

AU - Myers, Simon

AU - Newman, William

AU - Nimmo, Elaine R.

AU - O’Donovan, Michael C.

AU - Onipinla, Abiodun

AU - Ovington, Nigel R.

AU - Owen, Michael J.

AU - Palin, Kimmo

AU - Palotie, Aarno

AU - Parnell, Kirstie

AU - Pearson, Richard

AU - Pernet, David

AU - Perry, John Rb

AU - Phillips, Anne

AU - Plagnol, Vincent

AU - Prescott, Natalie J.

AU - Prokopenko, Inga

AU - Quail, Michael A.

AU - Rafelt, Suzanne

AU - Rayner, Nigel W.

AU - Reid, David M.

AU - Renwick, Anthony

AU - Thomson, Wendy

AU - Brown, Matthew A.

AU - Burton, Paul R.

AU - Hall, Alistair S.

AU - Blackwell, Jenefer M.

AU - Wood, Nicholas W.

AU - Spencer, Chris C.A.

PY - 2019/7/17

Y1 - 2019/7/17

N2 - Crohn Disease (CD) is a complex genetic disorder for which more than 140 genes have been identified using genome wide association studies (GWAS). However, the genetic architecture of the trait remains largely unknown. The recent development of machine learning (ML) approaches incited us to apply them to classify healthy and diseased people according to their genomic information. The Immunochip dataset containing 18,227 CD patients and 34,050 healthy controls enrolled and genotyped by the international Inflammatory Bowel Disease genetic consortium (IIBDGC) has been re-analyzed using a set of ML methods: penalized logistic regression (LR), gradient boosted trees (GBT) and artificial neural networks (NN). The main score used to compare the methods was the Area Under the ROC Curve (AUC) statistics. The impact of quality control (QC), imputing and coding methods on LR results showed that QC methods and imputation of missing genotypes may artificially increase the scores. At the opposite, neither the patient/control ratio nor marker preselection or coding strategies significantly affected the results. LR methods, including Lasso, Ridge and ElasticNet provided similar results with a maximum AUC of 0.80. GBT methods like XGBoost, LightGBM and CatBoost, together with dense NN with one or more hidden layers, provided similar AUC values, suggesting limited epistatic effects in the genetic architecture of the trait. ML methods detected near all the genetic variants previously identified by GWAS among the best predictors plus additional predictors with lower effects. The robustness and complementarity of the different methods are also studied. Compared to LR, non-linear models such as GBT or NN may provide robust complementary approaches to identify and classify genetic markers.

AB - Crohn Disease (CD) is a complex genetic disorder for which more than 140 genes have been identified using genome wide association studies (GWAS). However, the genetic architecture of the trait remains largely unknown. The recent development of machine learning (ML) approaches incited us to apply them to classify healthy and diseased people according to their genomic information. The Immunochip dataset containing 18,227 CD patients and 34,050 healthy controls enrolled and genotyped by the international Inflammatory Bowel Disease genetic consortium (IIBDGC) has been re-analyzed using a set of ML methods: penalized logistic regression (LR), gradient boosted trees (GBT) and artificial neural networks (NN). The main score used to compare the methods was the Area Under the ROC Curve (AUC) statistics. The impact of quality control (QC), imputing and coding methods on LR results showed that QC methods and imputation of missing genotypes may artificially increase the scores. At the opposite, neither the patient/control ratio nor marker preselection or coding strategies significantly affected the results. LR methods, including Lasso, Ridge and ElasticNet provided similar results with a maximum AUC of 0.80. GBT methods like XGBoost, LightGBM and CatBoost, together with dense NN with one or more hidden layers, provided similar AUC values, suggesting limited epistatic effects in the genetic architecture of the trait. ML methods detected near all the genetic variants previously identified by GWAS among the best predictors plus additional predictors with lower effects. The robustness and complementarity of the different methods are also studied. Compared to LR, non-linear models such as GBT or NN may provide robust complementary approaches to identify and classify genetic markers.

UR - http://www.scopus.com/inward/record.url?scp=85069470428&partnerID=8YFLogxK

U2 - 10.1038/s41598-019-46649-z

DO - 10.1038/s41598-019-46649-z

M3 - Article

VL - 9

JO - Scientific Reports

JF - Scientific Reports

SN - 2045-2322

IS - 1

M1 - 10351

ER -