dc.rights.license | CC-BY-NC-ND | |
dc.contributor.advisor | Krempl, dr. Ing. habil G. | |
dc.contributor.advisor | van Ommen, Dr. M. T. | |
dc.contributor.author | Aas, B.F. | |
dc.date.accessioned | 2020-07-30T18:00:26Z | |
dc.date.available | 2020-07-30T18:00:26Z | |
dc.date.issued | 2020 | |
dc.identifier.uri | https://studenttheses.uu.nl/handle/20.500.12932/36423 | |
dc.description.abstract | Toxicology is a field plagued by lack of experimental data and labels, as assessment of chemical toxicity is a time-consuming and costly process, all the while release of new substances grow in number. This further strengthens the need for robust screening tools and models for classification. Chemical similarity screening traditionally include a two-dimensional fingerprint representation of a chemical sub-structure, in which a distance measure between fingerprints determines similarity. This approach neglects potential importance ordering for sub-structures. The novelty of the approach presented in this paper aims to model so-called persistent, bioaccumulative and toxic(PBT) substances based on their physical chemical properties, and whether such an approach is an improvement over related fingerprint based approaches. Aims further include to inspect whether feature importance match a priori expert expectation, and whether the results could be improved by application of active machine learning. Two baseline machine learning models were fit to naive and filtered physical chemical data in the form of Random Forests and Support Vector Machine. The best performing model achieved a 94.28%classification accuracy, and was also able to pick up on existing legal guideline thresholds for substance evaluation. Further hypothesis of expert feature importance was showed to be true, with added importance for features previously not considered. Further utilizing a curious machine learning algorithm named Active Learning, it was shown that a similar accuracy could be achieved with 40-50% less data used, with a demo for interactive annotation with a chemical expert that could serve as a crossreferencing check on expert chemical evaluation. Albeit in need of further confirming data, the main contribution of this paper is the novel approach of using physio-chemical data, showing the value of utilizing machine learning algorithms as tool for the classification of harmful chemicals. | |
dc.description.sponsorship | Utrecht University | |
dc.format.extent | 2235111 | |
dc.format.mimetype | application/pdf | |
dc.language.iso | en_US | |
dc.title | Chemical Similarity Screening With Machine Learning and Active Learning Using Physical Chemical Properties | |
dc.type.content | Master Thesis | |
dc.rights.accessrights | Open Access | |
dc.subject.keywords | Toxicology, Machine Learning, Chemical Similarity, Random
Forest, Support Vector Machine, Active Learning, PBT substances | |
dc.subject.courseuu | Artificial Intelligence | |