Karim, Md Rezaul, Cochez, Michael ORCID: 0000-0001-5726-4638, Zappa, Achille ORCID: 0000-0003-4040-9620, Sahay, Ratnesh, Rebholz-Schuhmann, Dietrich, Beyan, Oya and Decker, Stefan (2022). Convolutional Embedded Networks for Population Scale Clustering and Bio-Ancestry Inferencing. IEEE-ACM Trans. Comput. Biol. Bioinform., 19 (1). S. 369 - 383. LOS ALAMITOS: IEEE COMPUTER SOC. ISSN 1557-9964

Full text not available from this repository.

Abstract

The study of genetic variants (GVs) can help find correlating population groups and to identify cohorts that are predisposed to common diseases and explain differences in disease susceptibility and how patients react to drugs. Machine learning techniques are increasingly being applied to identify interacting GVs to understand their complex phenotypic traits. Since the performance of a learning algorithm not only depends on the size and nature of the data but also on the quality of underlying representation, deep neural networks (DNNs) can learn non-linear mappings that allow transforming GVs data into more clustering and classification friendly representations than manual feature selection. In this paper, we propose convolutional embedded networks (CEN) in which we combine two DNN architectures called convolutional embedded clustering (CEC) and convolutional autoencoder (CAE) classifier for clustering individuals and predicting geographic ethnicity based on GVs, respectively. We employed CAE-based representation learning to 95 million GVs from the '1000 genomes' (covering 2,504 individuals from 26 ethnic origins) and 'Simons genome diversity' (covering 279 individuals from 130 ethnic origins) projects. Quantitative and qualitative analyses with a focus on accuracy and scalability show that our approach outperforms state-of-the-art approaches such as VariantSpark and ADMIXTURE. In particular, CEC can cluster targeted population groups in 22 hours with an adjusted rand index (ARI) of 0.915, the normalized mutual information (NMI) of 0.92, and the clustering accuracy (ACC) of 89 percent. Contrarily, the CAE classifier can predict the geographic ethnicity of unknown samples with an F1 and Mathews correlation coefficient (MCC) score of 0.9004 and 0.8245, respectively. Further, to provide interpretations of the predictions, we identify significant biomarkers using gradient boosted trees (GBT) and SHapley Additive exPlanations (SHAP). Overall, our approach is transparent and faster than the baseline methods, and scalable for 5 to 100 percent of the full human genome.

Item Type: Journal Article
Creators:
CreatorsEmailORCIDORCID Put Code
Karim, Md RezaulUNSPECIFIEDUNSPECIFIEDUNSPECIFIED
Cochez, MichaelUNSPECIFIEDorcid.org/0000-0001-5726-4638UNSPECIFIED
Zappa, AchilleUNSPECIFIEDorcid.org/0000-0003-4040-9620UNSPECIFIED
Sahay, RatneshUNSPECIFIEDUNSPECIFIEDUNSPECIFIED
Rebholz-Schuhmann, DietrichUNSPECIFIEDUNSPECIFIEDUNSPECIFIED
Beyan, OyaUNSPECIFIEDUNSPECIFIEDUNSPECIFIED
Decker, StefanUNSPECIFIEDUNSPECIFIEDUNSPECIFIED
URN: urn:nbn:de:hbz:38-679583
DOI: 10.1109/TCBB.2020.2994649
Journal or Publication Title: IEEE-ACM Trans. Comput. Biol. Bioinform.
Volume: 19
Number: 1
Page Range: S. 369 - 383
Date: 2022
Publisher: IEEE COMPUTER SOC
Place of Publication: LOS ALAMITOS
ISSN: 1557-9964
Language: English
Faculty: Unspecified
Divisions: Unspecified
Subjects: no entry
Uncontrolled Keywords:
KeywordsLanguage
GENETIC-VARIATION; ADMIXTUREMultiple languages
Biochemical Research Methods; Computer Science, Interdisciplinary Applications; Mathematics, Interdisciplinary Applications; Statistics & ProbabilityMultiple languages
URI: http://kups.ub.uni-koeln.de/id/eprint/67958

Downloads

Downloads per month over past year

Altmetric

Export

Actions (login required)

View Item View Item