Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributorMatthieu Brinkhuis, Georg Krempl, Joop Snijder
dc.contributor.advisorBrinkhuis, Matthieu
dc.contributor.authorGrinsven, Micha van
dc.date.accessioned2023-03-31T00:00:36Z
dc.date.available2023-03-31T00:00:36Z
dc.date.issued2023
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/43734
dc.description.abstractActive Learning is a relatively underused part of the machine learning domain in the real world for textual data that has shown better performance than Passive Learning. In this research, Active Learning is applied to two unbalanced datasets on the now-defunct energy company Enron and the Dutch oil company Shell. The Enron data is classified on the presence of information on logistics in documents whereas the Shell dataset is part of a current investigation into possible corruption by Follow The Money. This research attempts to aid this investigation by identifying documents belonging to a storyline in the dataset. Classification of documents is performed by looking only at the textual data in these datasets. To test the method the Enron dataset is used and after testing the method it is applied to the Shell dataset. It turns out that by using a combination of Active Learning and Natural Language Processing on the Shell data, an F1-score of 0.87 together with an accuracy of 91% can be achieved using only 5% of labeled data. Therefore, Active Learning can aid in the investigation of possible corruption. ASReview is used to facilitate this research. The setup presented in this research could be applied to almost any textual data classification problem.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectBy using a combination of active learning and Natural Language processing, it is researched whether documents containing specific properties can be detected using a more efficient method than is currently available.
dc.titleThe automatization and adressal of processes such as corruption investigations and document analysis using Active Learning and Natural Language Processing.
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsActive Learning; Natural Language Processing; Text classification; ASReview
dc.subject.courseuuBusiness Informatics
dc.thesis.id13710


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record