Semi-supervised learning for Technology Assisted Review

Öz, Ercan

View/Open

Master_Thesis_Ercan_Oz_7974523.pdf (4.045Mb)

Publication date

2023

Author

Öz, Ercan

Metadata

Show full item record

Summary

Finding all documents relevant to a specific information need in a potentially large collection of documents is essential for many researchers. This is essential not only for researchers who need to sift through thousands of studies to determine which studies are relevant for their meta-analysis but also for clinicians, policy-makers, journalists, and even the general public. Technology Assisted Review (TAR) incorporates machine learning algorithms and human feedback to find all relevant documents to achieve complete recall at a minimal cost. This study investigates methods to enhance the performance of TAR. The availability of labeled data is often limited due to the high costs associated with labeling the data in terms of time and resources. A lack of labeled data can limit a model's capacity for generalization. Semi-supervised learning (SSL) techniques, which use unlabeled data to improve model performance, were examined to address this limitation. This thesis studies various SSL techniques for binary classification and evaluates their contributions to the TAR process. We compared the performance of five semi-supervised learning classifiers within TAR against their supervised equivalents. The findings highlight that the semi-supervised Multinomial Naive Bayes classifier, with many-to-one correspondence via sub-topics, was able to improve the performance over its supervised counterpart multiple times, particularly in the two datasets with the lowest percentage of relevant documents. Significant improvements were also demonstrated for some datasets by combining AutoTAR and semi-supervised Multinomial Naive Bayes with sub-topics, compared to the supervised AutoTAR model. In contrast, label spreading and Support Vector Machines with self-training less frequently outperformed their supervised counterparts. Although semi-supervised models did not consistently outperform their supervised counterparts, this research demonstrates the potential for improved performance using semi-supervised models. This was most notably observed with the semi-supervised Multinomial Naive Bayes model with many-to-one correspondence.

URI

https://studenttheses.uu.nl/handle/20.500.12932/44266

Collections

Theses