Semi-supervised learning for Technology Assisted Review

Öz, Ercan

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Feelders, Ad
dc.contributor.author	Öz, Ercan
dc.date.accessioned	2023-07-22T00:01:54Z
dc.date.available	2023-07-22T00:01:54Z
dc.date.issued	2023
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/44266
dc.description.abstract	Finding all documents relevant to a specific information need in a potentially large collection of documents is essential for many researchers. This is essential not only for researchers who need to sift through thousands of studies to determine which studies are relevant for their meta-analysis but also for clinicians, policy-makers, journalists, and even the general public. Technology Assisted Review (TAR) incorporates machine learning algorithms and human feedback to find all relevant documents to achieve complete recall at a minimal cost. This study investigates methods to enhance the performance of TAR. The availability of labeled data is often limited due to the high costs associated with labeling the data in terms of time and resources. A lack of labeled data can limit a model's capacity for generalization. Semi-supervised learning (SSL) techniques, which use unlabeled data to improve model performance, were examined to address this limitation. This thesis studies various SSL techniques for binary classification and evaluates their contributions to the TAR process. We compared the performance of five semi-supervised learning classifiers within TAR against their supervised equivalents. The findings highlight that the semi-supervised Multinomial Naive Bayes classifier, with many-to-one correspondence via sub-topics, was able to improve the performance over its supervised counterpart multiple times, particularly in the two datasets with the lowest percentage of relevant documents. Significant improvements were also demonstrated for some datasets by combining AutoTAR and semi-supervised Multinomial Naive Bayes with sub-topics, compared to the supervised AutoTAR model. In contrast, label spreading and Support Vector Machines with self-training less frequently outperformed their supervised counterparts. Although semi-supervised models did not consistently outperform their supervised counterparts, this research demonstrates the potential for improved performance using semi-supervised models. This was most notably observed with the semi-supervised Multinomial Naive Bayes model with many-to-one correspondence.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	Finding all documents relevant to a specific information need in a potentially large collection of documents is essential for many researchers. Technology Assisted Review (TAR) incorporates machine learning algorithms and human feedback to find all relevant documents to achieve complete recall at a minimal cost. This thesis studies various semi-supervised learning (SSL) techniques for binary classification and evaluates their contributions to the TAR process.
dc.title	Semi-supervised learning for Technology Assisted Review
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Technology Assisted Review; Semi-supervised learning; Multinomial Naive Bayes; Label spreading; Support Vector Machine; Work Saved over Sampling; Self-training; Active learning
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	19858

Files in this item

Name:: Master_Thesis_Ercan_Oz_7974523.pdf
Size:: 4.045Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record