Random Forests for Plasmid Detection - An Exercise in Model Building and Evaluation

Gilliquet, Ethel

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Zomer, Aldert
dc.contributor.author	Gilliquet, Ethel
dc.date.accessioned	2022-07-07T00:00:37Z
dc.date.available	2022-07-07T00:00:37Z
dc.date.issued	2022
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/41699
dc.description.abstract	Plasmids are bacterial genetic elements that are replicated and transferred independently from the chromosome. Because of their independent mechanisms of replication and transfer, the study of plasmids is of special interest in epidemiology. The introduction of short read sequencers has brought an abundance of data to microbial genomics with great potential to increase knowledge of microbial biology and inform epidemiological decision making. With this increase in data availability comes the need for computational methods to extract meaningful information from that data. Machine learning tools have been developed to distinguish plasmids from chromosomes in short read draft genome assemblies. RFPlasmid is such a tool that uses random forests to classify bacterial contigs. To explore potential improvements of RFPlasmid, a machine learning pipeline was developed in Scikit-Learn. The machine learning pipeline addresses the issue of imbalanced datasets, which is a common problem as generally more chromosomes are sequenced than plasmids. It also probed several methods of feature selection to aid in separating signal from noise in a wide and sparse dataset and thereby improve classifications. Imbalance remains a difficult challenge which requires a multi-faceted approach to improve models of species for which few plasmid sequences are publicly available. Feature selection did not improve explainability or reduce model complexity. Critical issues came to light showing the combination of fully grown random forests using kmers is problematic when modeling plasmids. The insights from this project can be used as a starting point to develop better machine learning algorithms for plasmid detection. However, other computational methods, including graph, mapping and clustering based approaches may be more promising.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	Plasmids are bacterial genetic elements replicated independently from the chromosome. To explore potential improvements of plasmid detection tools such as RFPlasmid, a machine learning pipeline was developed. Results show fully grown random forests and kmers cause models to overfit. Datasets used are too imbalanced, wide and noisy. The binary distinction between chromosomes and plasmids does not fit underlying biology. Feature selection did not improve explainability or reduce model complexity.
dc.title	Random Forests for Plasmid Detection - An Exercise in Model Building and Evaluation
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	plasmid; random forest; overfitting; imbalanced dataset; kmers; bacterial genomics; machine learning; feature selection; plasmid detection; explainability; model complexity; model evaluation; Sklearn; Python
dc.subject.courseuu	Molecular and Cellular Life Sciences
dc.thesis.id	5101

Files in this item

Name:: PlasmidDetectionwithRandomFore ...
Size:: 1.917Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record