Single-cell data combined with phenotypes improves variant interpretation

Timothy Chapman, Timo Lassmann

Research output: Contribution to journalArticlepeer-review

Abstract

Background: Whole genome sequencing offers significant potential to improve the diagnosis and treatment of rare diseases by enabling the identification of thousands of rare, potentially pathogenic variants. Existing variant prioritisation tools can be complemented by approaches that incorporate phenotype specificity and provide contextual biological information, such as tissue or cell-type specificity. We hypothesised that integrating single-cell gene expression data into phenotype-specific models would improve the accuracy and interpretability of pathogenic variant prioritisation. Methods: To test this hypothesis, we developed IMPPROVE, a new tool that constructs phenotype-specific ensemble models integrating CADD scores with bulk and single-cell gene expression data. We constructed a total of 1,866 Random Forest models for individual HPO terms, incorporating both bulk and single cell expression data. Results: Our phenotype-specific models utilising expression data can better predict pathogenic variants in 90% of the phenotypes (HPO terms) considered. Using single-cell expression data instead of bulk benefited the models, significantly shifting the proportion of pathogenic variants that were correctly identified at a fixed false positive rate (p < 10(-30), using an approximate Wilcoxon signed rank test). We found 57 phenotypes' models exhibited a large performance difference, depending on the dataset used. Further analysis revealed biological links between the pathology and the tissues or cell-types used by these 57 models. Conclusions: Phenotype-specific models that integrate gene expression data with CADD scores show great promise in improving variant prioritisation. In addition to improving diagnostic accuracy, these models offer insights into the underlying biological mechanisms of rare diseases. Enriching existing pathogenicity-related scores with gene expression datasets has the potential to advance personalised medicine through more accurate and interpretable variant prioritisation.
Original languageEnglish
Article number540
Number of pages16
JournalBMC Genomics
Volume26
Issue number1
DOIs
Publication statusPublished - Dec 2025

Fingerprint

Dive into the research topics of 'Single-cell data combined with phenotypes improves variant interpretation'. Together they form a unique fingerprint.

Cite this