scExtract: leveraging large language models for fully automated single-cell RNA-seq data annotation and prior-informed multi-dataset integration

Yuxuan Wu; Fuchou Tang

doi:10.1186/s13059-025-03639-x

Genome Biology (Jun 2025)

scExtract: leveraging large language models for fully automated single-cell RNA-seq data annotation and prior-informed multi-dataset integration

Yuxuan Wu,
Fuchou Tang

Affiliations

Yuxuan Wu: Biomedical Pioneering Innovation Center, School of Life Sciences, Peking University
Fuchou Tang: Biomedical Pioneering Innovation Center, School of Life Sciences, Peking University

DOI: https://doi.org/10.1186/s13059-025-03639-x
Journal volume & issue: Vol. 26, no. 1
pp. 1 – 28

Abstract

Read online

Abstract Single-cell RNA sequencing has revolutionized cellular heterogeneity research, but analyzing the abundance of unannotated public datasets remains challenging. We present scExtract, a framework leveraging large language models to automate scRNA-seq data analysis from preprocessing to annotation and integration. scExtract extracts information from research articles to guide data processing, outperforming existing reference transfer methods in benchmarks. We introduce scanorama-prior and cellhint-prior, which incorporate prior annotation information for improved batch correction while preserving biological diversities. We demonstrate scExtract’s utility by integrating 14 datasets to create a comprehensive human skin atlas of 440,000 cells.

Published in Genome Biology

ISSN: 1474-760X (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Science: Biology (General): Genetics
Website: https://genomebiology.biomedcentral.com/

About the journal

Abstract

Keywords