A probabilistic graphical model for estimating selection coefficients of nonsynonymous variants from human population sequence data

Yige Zhao; Tian Lan; Guojie Zhong; Jake Hagen; Hongbing Pan; Wendy K. Chung; Yufeng Shen

doi:10.1038/s41467-025-59937-2

Nature Communications (May 2025)

A probabilistic graphical model for estimating selection coefficients of nonsynonymous variants from human population sequence data

Yige Zhao,
Tian Lan,
Guojie Zhong,
Jake Hagen,
Hongbing Pan,
Wendy K. Chung,
Yufeng Shen

Affiliations

Yige Zhao: Department of Systems Biology, Columbia University Irving Medical Center
Tian Lan: Department of Systems Biology, Columbia University Irving Medical Center
Guojie Zhong: Department of Systems Biology, Columbia University Irving Medical Center
Jake Hagen: Department of Systems Biology, Columbia University Irving Medical Center
Hongbing Pan: Department of Biomedical Informatics, Columbia University Irving Medical Center
Wendy K. Chung: Department of Pediatrics, Boston Children’s Hospital and Harvard Medical School
Yufeng Shen: Department of Systems Biology, Columbia University Irving Medical Center

DOI: https://doi.org/10.1038/s41467-025-59937-2
Journal volume & issue: Vol. 16, no. 1
pp. 1 – 12

Abstract

Read online

Abstract Accurately predicting the effect of missense variants is important in discovering disease risk genes and clinical genetic diagnostics. Commonly used computational methods predict pathogenicity, which does not capture the quantitative impact on fitness in humans. We develop a method, MisFit, to estimate missense fitness effect using a graphical model. MisFit jointly models the effect at a molecular level ( $$d$$ d ) and a population level (selection coefficient, $$s$$ s ), assuming that in the same gene, missense variants with similar $$d$$ d have similar $$s$$ s . We train it by maximizing probability of observed allele counts in 236,017 individuals of European ancestry. We show that $$s$$ s is informative in predicting allele frequency across ancestries and consistent with the fraction of de novo mutations in sites under strong selection. Further, $$s$$ s outperforms previous methods in prioritizing de novo missense variants in individuals with neurodevelopmental disorders. In conclusion, MisFit accurately predicts $$s$$ s and yields new insights from genomic data.

Published in Nature Communications

ISSN: 2041-1723 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/ncomms/

About the journal