View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Random Forests for Plasmid Detection - An Exercise in Model Building and Evaluation

        Thumbnail
        View/Open
        PlasmidDetectionwithRandomForests_Final_20211221.pdf (1.917Mb)
        Publication date
        2022
        Author
        Gilliquet, Ethel
        Metadata
        Show full item record
        Summary
        Plasmids are bacterial genetic elements that are replicated and transferred independently from the chromosome. Because of their independent mechanisms of replication and transfer, the study of plasmids is of special interest in epidemiology. The introduction of short read sequencers has brought an abundance of data to microbial genomics with great potential to increase knowledge of microbial biology and inform epidemiological decision making. With this increase in data availability comes the need for computational methods to extract meaningful information from that data. Machine learning tools have been developed to distinguish plasmids from chromosomes in short read draft genome assemblies. RFPlasmid is such a tool that uses random forests to classify bacterial contigs. To explore potential improvements of RFPlasmid, a machine learning pipeline was developed in Scikit-Learn. The machine learning pipeline addresses the issue of imbalanced datasets, which is a common problem as generally more chromosomes are sequenced than plasmids. It also probed several methods of feature selection to aid in separating signal from noise in a wide and sparse dataset and thereby improve classifications. Imbalance remains a difficult challenge which requires a multi-faceted approach to improve models of species for which few plasmid sequences are publicly available. Feature selection did not improve explainability or reduce model complexity. Critical issues came to light showing the combination of fully grown random forests using kmers is problematic when modeling plasmids. The insights from this project can be used as a starting point to develop better machine learning algorithms for plasmid detection. However, other computational methods, including graph, mapping and clustering based approaches may be more promising.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/41699
        Collections
        • Theses
        Utrecht university logo