Genome Biology (Jun 2025)

scExtract: leveraging large language models for fully automated single-cell RNA-seq data annotation and prior-informed multi-dataset integration

  • Yuxuan Wu,
  • Fuchou Tang

DOI
https://doi.org/10.1186/s13059-025-03639-x
Journal volume & issue
Vol. 26, no. 1
pp. 1 – 28

Abstract

Read online

Abstract Single-cell RNA sequencing has revolutionized cellular heterogeneity research, but analyzing the abundance of unannotated public datasets remains challenging. We present scExtract, a framework leveraging large language models to automate scRNA-seq data analysis from preprocessing to annotation and integration. scExtract extracts information from research articles to guide data processing, outperforming existing reference transfer methods in benchmarks. We introduce scanorama-prior and cellhint-prior, which incorporate prior annotation information for improved batch correction while preserving biological diversities. We demonstrate scExtract’s utility by integrating 14 datasets to create a comprehensive human skin atlas of 440,000 cells.

Keywords