View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Scaling and normalization of Word Embeddings: Evaluating the impact of ASReview performance

        Thumbnail
        View/Open
        Applied_Data_Science_master_s_thesis-10-1.pdf (7.862Mb)
        Publication date
        2024
        Author
        Willems, Sjard
        Metadata
        Show full item record
        Summary
        Active learning enhances efficiency in systematic reviews by optimizing the work saved over random sampling (WSS) and identifying relevant papers. This study investigates the impact of various preprocessing techniques on the performance of active learning models. Specifically, it evaluates the effectiveness of TF-IDF, SBERT, and Doc2Vec embeddings combined with different normalization and scaling methods, using Naive Bayes and logistic regression classifiers. The findings indicate that TF-IDF embeddings, particularly with L2 normalization and adding the absolute minimum value paired with Naive Bayes, performed the best, achieving high recall and low average time to find relevant documents. The highest WSS of SBERT combinations is achieved by combining z-score or Pareto normalization and absolute minimum scaling with logistic regression, showed 3% lower WSS and required computational resources. Doc2Vec, although less effective than SBERT, performed well with z-score or Pareto normalization and CDF scaling without needing a GPU. While TF-IDF remains a robust benchmark, SBERT and Doc2Vec offer promising alternatives for improving systematic reviews, warranting further exploration with additional configurations and fine-tuning. Further research should explore more combinations of feature extractors, classifiers, and normalization and scaling techniques.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/46956
        Collections
        • Theses
        Utrecht university logo