Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorZomer, Aldert
dc.contributor.authorGilliquet, Ethel
dc.date.accessioned2022-07-07T00:00:37Z
dc.date.available2022-07-07T00:00:37Z
dc.date.issued2022
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/41699
dc.description.abstractPlasmids are bacterial genetic elements that are replicated and transferred independently from the chromosome. Because of their independent mechanisms of replication and transfer, the study of plasmids is of special interest in epidemiology. The introduction of short read sequencers has brought an abundance of data to microbial genomics with great potential to increase knowledge of microbial biology and inform epidemiological decision making. With this increase in data availability comes the need for computational methods to extract meaningful information from that data. Machine learning tools have been developed to distinguish plasmids from chromosomes in short read draft genome assemblies. RFPlasmid is such a tool that uses random forests to classify bacterial contigs. To explore potential improvements of RFPlasmid, a machine learning pipeline was developed in Scikit-Learn. The machine learning pipeline addresses the issue of imbalanced datasets, which is a common problem as generally more chromosomes are sequenced than plasmids. It also probed several methods of feature selection to aid in separating signal from noise in a wide and sparse dataset and thereby improve classifications. Imbalance remains a difficult challenge which requires a multi-faceted approach to improve models of species for which few plasmid sequences are publicly available. Feature selection did not improve explainability or reduce model complexity. Critical issues came to light showing the combination of fully grown random forests using kmers is problematic when modeling plasmids. The insights from this project can be used as a starting point to develop better machine learning algorithms for plasmid detection. However, other computational methods, including graph, mapping and clustering based approaches may be more promising.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectPlasmids are bacterial genetic elements replicated independently from the chromosome. To explore potential improvements of plasmid detection tools such as RFPlasmid, a machine learning pipeline was developed. Results show fully grown random forests and kmers cause models to overfit. Datasets used are too imbalanced, wide and noisy. The binary distinction between chromosomes and plasmids does not fit underlying biology. Feature selection did not improve explainability or reduce model complexity.
dc.titleRandom Forests for Plasmid Detection - An Exercise in Model Building and Evaluation
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsplasmid; random forest; overfitting; imbalanced dataset; kmers; bacterial genomics; machine learning; feature selection; plasmid detection; explainability; model complexity; model evaluation; Sklearn; Python
dc.subject.courseuuMolecular and Cellular Life Sciences
dc.thesis.id5101


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record