Assessing the impact of assemblers on virus detection in a de novo metagenomic analysis pipeline

Daniel J. White, Jing Wang, Richard J. Hall

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

Applying high-throughput sequencing to pathogen discovery is a relatively new field, the objective of which is to find disease-causing agents when little or no background information on disease is available. Key steps in the process are the generation of millions of sequence reads from an infected tissue sample, followed by assembly of these reads into longer, contiguous stretches of nucleotide sequences, and then identification of the contigs by matching them to known databases, such as those stored at GenBank or Ensembl. This technique, that is, de novo metagenomics, is particularly useful when the pathogen is viral and strong discriminatory power can be achieved. However, recently, we found that striking differences in results can be achieved when different assemblers were used. In this study, we test formally the impact of five popular assemblers (MIRA, VELVET, METAVELVET, SPADES, and OMEGA) on the detection of a novel virus and assembly of its whole genome in a data set for which we have confirmed the presence of the virus by empirical laboratory techniques, and compare the overall performance between assemblers. Our results show that if results from only one assembler are considered, biologically important reads can easily be overlooked. The impacts of these results on the field of pathogen discovery are considered.

Original languageEnglish
Pages (from-to)874-881
Number of pages8
JournalJournal of Computational Biology
Volume24
Issue number9
DOIs
Publication statusPublished - 1 Sep 2017
Externally publishedYes

Fingerprint

Metagenomics
Pathogens
Viruses
Virus
Pipelines
Virus Assembly
Nucleic Acid Databases
Genome
Databases
Nucleotides
Stretch
Genes
Throughput
Tissue
Sequencing
High Throughput
Datasets

Cite this

@article{e0bd266b74814fa6aa7ecd53d81f281f,
title = "Assessing the impact of assemblers on virus detection in a de novo metagenomic analysis pipeline",
abstract = "Applying high-throughput sequencing to pathogen discovery is a relatively new field, the objective of which is to find disease-causing agents when little or no background information on disease is available. Key steps in the process are the generation of millions of sequence reads from an infected tissue sample, followed by assembly of these reads into longer, contiguous stretches of nucleotide sequences, and then identification of the contigs by matching them to known databases, such as those stored at GenBank or Ensembl. This technique, that is, de novo metagenomics, is particularly useful when the pathogen is viral and strong discriminatory power can be achieved. However, recently, we found that striking differences in results can be achieved when different assemblers were used. In this study, we test formally the impact of five popular assemblers (MIRA, VELVET, METAVELVET, SPADES, and OMEGA) on the detection of a novel virus and assembly of its whole genome in a data set for which we have confirmed the presence of the virus by empirical laboratory techniques, and compare the overall performance between assemblers. Our results show that if results from only one assembler are considered, biologically important reads can easily be overlooked. The impacts of these results on the field of pathogen discovery are considered.",
keywords = "algorithms, assemblers, de novo metagenomic, spathogen discovery, test",
author = "White, {Daniel J.} and Jing Wang and Hall, {Richard J.}",
year = "2017",
month = "9",
day = "1",
doi = "10.1089/cmb.2017.0008",
language = "English",
volume = "24",
pages = "874--881",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc",
number = "9",

}

Assessing the impact of assemblers on virus detection in a de novo metagenomic analysis pipeline. / White, Daniel J.; Wang, Jing; Hall, Richard J.

In: Journal of Computational Biology, Vol. 24, No. 9, 01.09.2017, p. 874-881.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Assessing the impact of assemblers on virus detection in a de novo metagenomic analysis pipeline

AU - White, Daniel J.

AU - Wang, Jing

AU - Hall, Richard J.

PY - 2017/9/1

Y1 - 2017/9/1

N2 - Applying high-throughput sequencing to pathogen discovery is a relatively new field, the objective of which is to find disease-causing agents when little or no background information on disease is available. Key steps in the process are the generation of millions of sequence reads from an infected tissue sample, followed by assembly of these reads into longer, contiguous stretches of nucleotide sequences, and then identification of the contigs by matching them to known databases, such as those stored at GenBank or Ensembl. This technique, that is, de novo metagenomics, is particularly useful when the pathogen is viral and strong discriminatory power can be achieved. However, recently, we found that striking differences in results can be achieved when different assemblers were used. In this study, we test formally the impact of five popular assemblers (MIRA, VELVET, METAVELVET, SPADES, and OMEGA) on the detection of a novel virus and assembly of its whole genome in a data set for which we have confirmed the presence of the virus by empirical laboratory techniques, and compare the overall performance between assemblers. Our results show that if results from only one assembler are considered, biologically important reads can easily be overlooked. The impacts of these results on the field of pathogen discovery are considered.

AB - Applying high-throughput sequencing to pathogen discovery is a relatively new field, the objective of which is to find disease-causing agents when little or no background information on disease is available. Key steps in the process are the generation of millions of sequence reads from an infected tissue sample, followed by assembly of these reads into longer, contiguous stretches of nucleotide sequences, and then identification of the contigs by matching them to known databases, such as those stored at GenBank or Ensembl. This technique, that is, de novo metagenomics, is particularly useful when the pathogen is viral and strong discriminatory power can be achieved. However, recently, we found that striking differences in results can be achieved when different assemblers were used. In this study, we test formally the impact of five popular assemblers (MIRA, VELVET, METAVELVET, SPADES, and OMEGA) on the detection of a novel virus and assembly of its whole genome in a data set for which we have confirmed the presence of the virus by empirical laboratory techniques, and compare the overall performance between assemblers. Our results show that if results from only one assembler are considered, biologically important reads can easily be overlooked. The impacts of these results on the field of pathogen discovery are considered.

KW - algorithms

KW - assemblers

KW - de novo metagenomic

KW - spathogen discovery

KW - test

UR - http://www.scopus.com/inward/record.url?scp=85029376250&partnerID=8YFLogxK

U2 - 10.1089/cmb.2017.0008

DO - 10.1089/cmb.2017.0008

M3 - Article

VL - 24

SP - 874

EP - 881

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 9

ER -