TY - JOUR
T1 - DNABERT-based explainable lncRNA identification in plant genome assemblies
AU - Danilevicz, Monica F.
AU - Gill, Mitchell
AU - Fernandez, Cassandria G.Tay
AU - Petereit, Jakob
AU - Upadhyaya, Shriprabha R.
AU - Batley, Jacqueline
AU - Bennamoun, Mohammed
AU - Edwards, David
AU - Bayer, Philipp E.
N1 - Funding Information:
This work was supported by resources provided by the Pawsey Supercomputing Research Centre with funding from the Australian Government and the Government of Western Australia. The Australian Government supported this work through the Australian Research Council (Projects DP210100296 , DP200100762 , and DE210100398 ) and the Grains Research and Development Corporation (Projects 9177539 and 9177591 ). Monica F. Danilevicz, Mitchell Gill and Cassandria G. Tay Fernandez were supported by the Research Training Program scholarship. Shriprabha Upadhyaya was supported by the Ad Hoc Postgraduate Scholarship. The Forrest Research Foundation further supported Monica F. Danilevicz.
Publisher Copyright:
© 2023 The Authors
PY - 2023/1
Y1 - 2023/1
N2 - Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regulation, involving both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small peptides. Machine learning models have predominantly used transcriptome data with manually defined features to detect lncRNAs, however, they often underrepresent the abundance of lncRNAs and can be biased in their detection. Here we present a study using Natural Language Processing (NLP) models to identify plant lncRNAs from genomic sequences rather than transcriptomic data. The NLP models were trained to predict lncRNAs for seven model and crop species (Zea mays, Arabidopsis thaliana, Brassica napus, Brassica oleracea, Brassica rapa, Glycine max and Oryza sativa) using publicly available genomic references. We demonstrated that lncRNAs can be accurately predicted from genomic sequences with the highest accuracy of 83.4% for Z. mays and the lowest accuracy of 57.9% for B. rapa, revealing that genome assembly quality might affect the accuracy of lncRNA identification. Furthermore, we demonstrated the potential of using NLP models for cross-species prediction with an average of 63.1% accuracy using target species not previously seen by the model. As more species are incorporated into the training datasets, we expect the accuracy to increase, becoming a more reliable tool for uncovering novel lncRNAs. Finally, we show that the models can be interpreted using explainable artificial intelligence to identify motifs important to lncRNA prediction and that these motifs frequently flanked the lncRNA sequence.
AB - Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regulation, involving both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small peptides. Machine learning models have predominantly used transcriptome data with manually defined features to detect lncRNAs, however, they often underrepresent the abundance of lncRNAs and can be biased in their detection. Here we present a study using Natural Language Processing (NLP) models to identify plant lncRNAs from genomic sequences rather than transcriptomic data. The NLP models were trained to predict lncRNAs for seven model and crop species (Zea mays, Arabidopsis thaliana, Brassica napus, Brassica oleracea, Brassica rapa, Glycine max and Oryza sativa) using publicly available genomic references. We demonstrated that lncRNAs can be accurately predicted from genomic sequences with the highest accuracy of 83.4% for Z. mays and the lowest accuracy of 57.9% for B. rapa, revealing that genome assembly quality might affect the accuracy of lncRNA identification. Furthermore, we demonstrated the potential of using NLP models for cross-species prediction with an average of 63.1% accuracy using target species not previously seen by the model. As more species are incorporated into the training datasets, we expect the accuracy to increase, becoming a more reliable tool for uncovering novel lncRNAs. Finally, we show that the models can be interpreted using explainable artificial intelligence to identify motifs important to lncRNA prediction and that these motifs frequently flanked the lncRNA sequence.
KW - Cross-species prediction
KW - Deep learning
KW - Genomic motif
KW - LncRNAs
KW - Natural language processing
UR - http://www.scopus.com/inward/record.url?scp=85177612244&partnerID=8YFLogxK
U2 - 10.1016/j.csbj.2023.11.025
DO - 10.1016/j.csbj.2023.11.025
M3 - Article
C2 - 38058296
AN - SCOPUS:85177612244
SN - 2001-0370
VL - 21
SP - 5676
EP - 5685
JO - Computational and Structural Biotechnology Journal
JF - Computational and Structural Biotechnology Journal
ER -