A Natural Language Processing Model for Bacterial Genome Assembly Using Gene Annotations
Summary
In this project, we explored a new way of assembling bacterial genomes by using a natural language
processing (NLP) approach, typically used for understanding text, to help piece together fragments of DNA.
Normally, when scientists try to assemble bacterial genomes, they face challenges due to repetitive DNA
sequences, which can make it difficult to correctly order the fragments, especially with current methods.
Our approach aimed to address these challenges by training a model to ”understand” connections between
DNA fragments, similar to how word-embedding models in language processing understand relationships
between words.
We started by analyzing gene sequences from known bacterial genomes, annotating them with specific
codes using a tool called Bakta, and training a Word2Vec model on these codes. This allowed our model
to learn how these genes usually connect in a bacterial genome. To test this, we first assembled DNA
fragments (called contigs) from short-read DNA data, again annotated with Bakta. By examining the
connections between the beginning and end of each contig, the model could suggest possible links between
contigs based on their similarity in code patterns.
We trained the model on multiple bacterial species together and also created models for each species
separately. While species-specific models performed better, the combined model for all species still worked
surprisingly well, which could simplify the process when working with mixed bacterial samples in real-world
applications. This approach shows promise in handling repetitive DNA regions and potentially improving
genome assembly accuracy in bacterial studies.
