SampleExplorer: using language models to discover relevant transcriptome data

Wee Loong Chin, Timo Lassmann

    Research output: Contribution to journalArticlepeer-review

    Abstract

    Motivation Over the last two decades, transcriptomics has become a standard technique in biomedical research. We now have large databases of RNA-seq data, accompanied by valuable metadata detailing scientific objectives and the experimental procedures used. The metadata is crucial in understanding and replicating published studies, but so far has been underutilized in helping researchers to discover existing datasets.Results We present SampleExplorer, a tool allowing researchers to search for relevant data using both text and gene set queries. SampleExplorer embeds sample metadata and uses a transformer-based language model to retrieve similar datasets. Extensive benchmarking (see ) using the ARCHS4 database demonstrates that SampleExplorer provides an effective approach for retrieving biologically relevant samples from large-scale transcriptomicdata. This tool provides an efficient approach for discovering relevant gene expression datasets in large public repositories. It improves sample and dataset identification across diverse experimental contexts, helping researchers leverage existing transcriptomic data for potential replication or verification studies. Availability and implementation: SampleExplorer is available as a Python package compatible with versions 3.9 to 3.11, available for installation via the Python Package Index (PyPI). The codebase and documentation are accessible at https://github.com/wlchin/SampleExplorer. (Supplementary Materials and Methods) provides detailed methodological information, including an algorithmic description of the retrieval process and data preparation steps.
    Original languageEnglish
    Article numberbtae759
    Number of pages5
    JournalBioinformatics
    Volume41
    Issue number1
    DOIs
    Publication statusPublished - 1 Jan 2025

    Fingerprint

    Dive into the research topics of 'SampleExplorer: using language models to discover relevant transcriptome data'. Together they form a unique fingerprint.

    Cite this