Awad, Mohamed (2022). Chromosome-By-Chromosome Assembly: A Scalable Method For De Novo Assembly. PhD thesis, Universität zu Köln.

[img]
Preview
PDF
Thesis_Mohamed_Awad_final_version.pdf

Download (6MB) | Preview

Abstract

High-quality genome assembly has wide applications in genetics and medical studies. However, reconstructing complete genome sequences from sequencing data is a complex computational problem. Numerous tools have been developed to assemble short and long reads into longer representative sequences. However, the generated genome assemblies are often fragmented due to the repetitive nature and heterozygosity, even for studies using the most updated long-read technologies. Therefore, a computational framework which can lead to gap-free chromosome-scale assemblies is an insistent demand for modern biology studies. In this dissertation, we introduced chromosome-by-chromosome assembly, a scalable computational framework for de novo genome assembly. We demonstrated its efficiency with the implementation of assembler GALA. GALA achieves chromosome-by-chromosome raw sequencing data separation through a multilayer graph algorithm which can effectively identify and resolve misassembles within preliminary assemblies, and subsequently cluster contigs from preliminary assemblies and raw reads into linkage groups. For complex genomes, extra information such as Hi-C, genetic maps and even motif analyses can be used to merge multiple linkage groups into bigger linkage groups, each representing a single chromosome. Assembly of each linkage group using existing assembly tools leads to gap-free complete genome assembly. Statistics based on the real data demonstrated that the strategy of chromosome-by-chromosome assembly can significantly simplify the complexity of assembly graph for most existing assembly tools, and achieve highly accurate gap-free chromosome-scale assembly. Firstly, we tested GALA on heterogeneous third-generation sequencing datasets with different depths to demonstrate its advantage. Our method showed outstanding performance in low-depth circumstances over the current de novo assembly pipelines. In addition, GALA successfully produced T2T assembly for C. elegans and seven human chromosomes. Furthermore, GALA assembled complete gap-free chromosome-arm pseudomolecules for A. thaliana and four human chromosomes. Interestingly, our method overcomes the technology barriers, facilitating straightforward assembly of genomes with heterogeneous datasets and algorithms, generating high-quality de novo assemblies. Secondly, we exploited GALA’s ability to handle heterogeneous data to achieve the gap-free chromosome-scale assembly of Cardamine hirsuta, C. oligosperma and C. resedifolia, close relatives of the model plant Arabidopsis thaliana. Impressively, GALA obtained a gap-free T2T de novo assembly of two Cardamine hirsuta strains, Azores and Oxford reference strain, and the C. oligosperma genome. GALA also successfully assembled five T2T C. resedifolia chromosomes and three chromosomes with a single centromeric gap. Additionally, we conducted a comparative genomic study between the assembled genomes to examine the collinearity and prominent structural variants among them. Finally, we applied the strategy of chromosome-by-chromosome assembly to metagenome, a more challenging scenario where multiple haplotypes were sequenced at different depths and mixed together. We developed MRDA to facilitate metagenomic data separation for chromosome-by-chromosome de novo assembly. MRDA was implemented through a triple-layer graph, following a reference-guided data separation strategy to classify the preliminary contigs and impose the chromosome-by-chromosome assembly to achieve multiple-haplotype assembly of the circular microbial molecule. Our method achieved outstanding performance in terms of contiguity and the number of recovered circular chromosomes compared to the current de novo assembly pipelines. Overall, we introduced a computational framework for chromosome-by-chromosome assembly. Based on this framework, we implemented two multilayer graph algorithms for gap-free chromosome-scale assembly of heterogeneous sequencing data. Our algorithms show very promising performances in the state-of-art de novo assembly.

Item Type: Thesis (PhD thesis)
Creators:
CreatorsEmailORCIDORCID Put Code
Awad, Mohamedmah.biotech2010@gmail.comUNSPECIFIEDUNSPECIFIED
URN: urn:nbn:de:hbz:38-619744
Date: 13 June 2022
Language: English
Faculty: Faculty of Mathematics and Natural Sciences
Divisions: Außeruniversitäre Forschungseinrichtungen > MPI for Plant Breeding Research
Subjects: Natural sciences and mathematics
Uncontrolled Keywords:
KeywordsLanguage
Genome AssemblyEnglish
Comparative genomicsEnglish
Metagenome assemblyEnglish
Chromosome level assemblyEnglish
UNSPECIFIEDEnglish
Date of oral exam: 2 March 2022
Referee:
NameAcademic Title
Tsiantis, MiltosProf. Dr.
Tresch, AchimProf. Dr.
Refereed: Yes
URI: http://kups.ub.uni-koeln.de/id/eprint/61974

Downloads

Downloads per month over past year

Export

Actions (login required)

View Item View Item