TY - JOUR
T1 - Nanopore sequencing data analysis using Microsoft Azure cloud computing service
AU - Truong, Linh
AU - Ayora, Felipe
AU - D'Orsogna, Lloyd
AU - Martinez, Patricia
AU - De Santis, Dianne
N1 - Funding Information:
This study was funded by Innovation Grant (Microsoft Australia). The authors (LT, FA, DDS) were granted $35,000 AUD for the development of the automatic pipeline in Microsoft Azure server. The sponsor (Microsoft Australia) played no role in the study design, data analysis or preparation of the manuscript. We would like to thank Mr Johnny Gorea for his valuable input in the conceptualization of this study. We are also grateful to all the colleagues at the Department of Clinical Immunology, PathWest for their technical assistance and troubleshooting.
Publisher Copyright:
Copyright: © 2022 Truong et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
PY - 2022/12
Y1 - 2022/12
N2 - Genetic information provides insights into the exome, genome, epigenetics and structural organisation of the organism. Given the enormous amount of genetic information, scientists are able to perform mammoth tasks to improve the standard of health care such as determining genetic influences on outcome of allogeneic transplantation. Cloud based computing has increasingly become a key choice for many scientists, engineers and institutions as it offers on-demand network access and users can conveniently rent rather than buy all required computing resources. With the positive advancements of cloud computing and nanopore sequencing data output, we were motivated to develop an automated and scalable analysis pipeline utilizing cloud infrastructure in Microsoft Azure to accelerate HLA genotyping service and improve the efficiency of the workflow at lower cost. In this study, we describe (i) the selection process for suitable virtual machine sizes for computing resources to balance between the best performance versus cost effectiveness; (ii) the building of Docker containers to include all tools in the cloud computational environment; (iii) the comparison of HLA genotype concordance between the in-house manual method and the automated cloud-based pipeline to assess data accuracy. In conclusion, the Microsoft Azure cloud based data analysis pipeline was shown to meet all the key imperatives for performance, cost, usability, simplicity and accuracy. Importantly, the pipeline allows for the ongoing maintenance and testing of version changes before implementation. This pipeline is suitable for the data analysis from MinION sequencing platform and could be adopted for other data analysis application processes.
AB - Genetic information provides insights into the exome, genome, epigenetics and structural organisation of the organism. Given the enormous amount of genetic information, scientists are able to perform mammoth tasks to improve the standard of health care such as determining genetic influences on outcome of allogeneic transplantation. Cloud based computing has increasingly become a key choice for many scientists, engineers and institutions as it offers on-demand network access and users can conveniently rent rather than buy all required computing resources. With the positive advancements of cloud computing and nanopore sequencing data output, we were motivated to develop an automated and scalable analysis pipeline utilizing cloud infrastructure in Microsoft Azure to accelerate HLA genotyping service and improve the efficiency of the workflow at lower cost. In this study, we describe (i) the selection process for suitable virtual machine sizes for computing resources to balance between the best performance versus cost effectiveness; (ii) the building of Docker containers to include all tools in the cloud computational environment; (iii) the comparison of HLA genotype concordance between the in-house manual method and the automated cloud-based pipeline to assess data accuracy. In conclusion, the Microsoft Azure cloud based data analysis pipeline was shown to meet all the key imperatives for performance, cost, usability, simplicity and accuracy. Importantly, the pipeline allows for the ongoing maintenance and testing of version changes before implementation. This pipeline is suitable for the data analysis from MinION sequencing platform and could be adopted for other data analysis application processes.
UR - http://www.scopus.com/inward/record.url?scp=85143380017&partnerID=8YFLogxK
U2 - 10.1371/journal.pone.0278609
DO - 10.1371/journal.pone.0278609
M3 - Article
C2 - 36459531
AN - SCOPUS:85143380017
VL - 17
JO - PLoS One
JF - PLoS One
SN - 1932-6203
IS - 12
M1 - e0278609
ER -