TY - JOUR
T1 - EnsembleCNV
T2 - An ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data
AU - Zhang, Zhongyang
AU - Cheng, Haoxiang
AU - Hong, Xiumei
AU - DI Narzo, Antonio F.
AU - Franzen, Oscar
AU - Peng, Shouneng
AU - Ruusalepp, Arno
AU - Kovacic, Jason C.
AU - Bjorkegren, Johan L.M.
AU - Wang, Xiaobin
AU - Hao, Ke
N1 - Funding Information:
National Institutes of Health [1R41DA042464-01, 1U01HD079068, 1R01ES029212-01, R01HL125863]; National Natural Science Foundation of China [91643201, 21876134]; Ministry of Science and Technology of China [2016YFC0206507]; Transatlantic Networks of Excellence Award from Foundation Leducq [12CVD02]. The FA study is supported in part by the Bunning Family Food Allergy Project/Food Allergy Initiative, Sacks Family Foundation Fund, Food Allergy Research and Education (FARE) and the National Institute of Allergy and Infectious Diseases [U01AI090727, R56AI080627 and R21AI088609, PI, Xiaobin Wang]. Funding for open access charge: Icahn School of Medicine at Mount Sinai Internal Research Fund. Conflict of interest statement. None declared.
Publisher Copyright:
© 2019 The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.
PY - 2019/4/23
Y1 - 2019/4/23
N2 - The associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data. EnsembleCNV (a) identifies and eliminates batch effects at raw data level; (b) assembles individual CNV calls into CNV regions (CNVRs) from multiple existing callers with complementary strengths by a heuristic algorithm; (c) re-genotypes each CNVR with local likelihood model adjusted by global information across multiple CNVRs; (d) refines CNVR boundaries by local correlation structure in copy number intensities; (e) provides direct CNV genotyping accompanied with confidence score, directly accessible for downstream quality control and association analysis. Benchmarked on two large datasets, ensembleCNV outperformed competing methods and achieved a high call rate (93.3%) and reproducibility (98.6%), while concurrently achieving high sensitivity by capturing 85% of common CNVs documented in the 1000 Genomes Project. Given this CNV call rate and accuracy, which are comparable to SNP genotyping, we suggest ensembleCNV holds significant promise for performing genome-wide CNV association studies and investigating how CNVs predispose to human diseases.
AB - The associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data. EnsembleCNV (a) identifies and eliminates batch effects at raw data level; (b) assembles individual CNV calls into CNV regions (CNVRs) from multiple existing callers with complementary strengths by a heuristic algorithm; (c) re-genotypes each CNVR with local likelihood model adjusted by global information across multiple CNVRs; (d) refines CNVR boundaries by local correlation structure in copy number intensities; (e) provides direct CNV genotyping accompanied with confidence score, directly accessible for downstream quality control and association analysis. Benchmarked on two large datasets, ensembleCNV outperformed competing methods and achieved a high call rate (93.3%) and reproducibility (98.6%), while concurrently achieving high sensitivity by capturing 85% of common CNVs documented in the 1000 Genomes Project. Given this CNV call rate and accuracy, which are comparable to SNP genotyping, we suggest ensembleCNV holds significant promise for performing genome-wide CNV association studies and investigating how CNVs predispose to human diseases.
UR - http://www.scopus.com/inward/record.url?scp=85064987913&partnerID=8YFLogxK
U2 - 10.1093/nar/gkz068
DO - 10.1093/nar/gkz068
M3 - Article
C2 - 30722045
AN - SCOPUS:85064987913
SN - 0305-1048
VL - 47
JO - Nucleic Acids Research
JF - Nucleic Acids Research
IS - 7
M1 - e39
ER -