View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        A Natural Language Processing Model for Bacterial Genome Assembly Using Gene Annotations

        Thumbnail
        View/Open
        Final_report_wilco_hermens (1).pdf (3.813Mb)
        Publication date
        2025
        Author
        Hermens, Wilco
        Metadata
        Show full item record
        Summary
        In this project, we explored a new way of assembling bacterial genomes by using a natural language processing (NLP) approach, typically used for understanding text, to help piece together fragments of DNA. Normally, when scientists try to assemble bacterial genomes, they face challenges due to repetitive DNA sequences, which can make it difficult to correctly order the fragments, especially with current methods. Our approach aimed to address these challenges by training a model to ”understand” connections between DNA fragments, similar to how word-embedding models in language processing understand relationships between words. We started by analyzing gene sequences from known bacterial genomes, annotating them with specific codes using a tool called Bakta, and training a Word2Vec model on these codes. This allowed our model to learn how these genes usually connect in a bacterial genome. To test this, we first assembled DNA fragments (called contigs) from short-read DNA data, again annotated with Bakta. By examining the connections between the beginning and end of each contig, the model could suggest possible links between contigs based on their similarity in code patterns. We trained the model on multiple bacterial species together and also created models for each species separately. While species-specific models performed better, the combined model for all species still worked surprisingly well, which could simplify the process when working with mixed bacterial samples in real-world applications. This approach shows promise in handling repetitive DNA regions and potentially improving genome assembly accuracy in bacterial studies.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/50743
        Collections
        • Theses
        Utrecht university logo