Genome Biology (Jun 2025)
scExtract: leveraging large language models for fully automated single-cell RNA-seq data annotation and prior-informed multi-dataset integration
Abstract
Abstract Single-cell RNA sequencing has revolutionized cellular heterogeneity research, but analyzing the abundance of unannotated public datasets remains challenging. We present scExtract, a framework leveraging large language models to automate scRNA-seq data analysis from preprocessing to annotation and integration. scExtract extracts information from research articles to guide data processing, outperforming existing reference transfer methods in benchmarks. We introduce scanorama-prior and cellhint-prior, which incorporate prior annotation information for improved batch correction while preserving biological diversities. We demonstrate scExtract’s utility by integrating 14 datasets to create a comprehensive human skin atlas of 440,000 cells.
Keywords