View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        The automatization and adressal of processes such as corruption investigations and document analysis using Active Learning and Natural Language Processing.

        Thumbnail
        View/Open
        Final_thesis_Mvg.pdf (1.247Mb)
        Publication date
        2023
        Author
        Grinsven, Micha van
        Metadata
        Show full item record
        Summary
        Active Learning is a relatively underused part of the machine learning domain in the real world for textual data that has shown better performance than Passive Learning. In this research, Active Learning is applied to two unbalanced datasets on the now-defunct energy company Enron and the Dutch oil company Shell. The Enron data is classified on the presence of information on logistics in documents whereas the Shell dataset is part of a current investigation into possible corruption by Follow The Money. This research attempts to aid this investigation by identifying documents belonging to a storyline in the dataset. Classification of documents is performed by looking only at the textual data in these datasets. To test the method the Enron dataset is used and after testing the method it is applied to the Shell dataset. It turns out that by using a combination of Active Learning and Natural Language Processing on the Shell data, an F1-score of 0.87 together with an accuracy of 91% can be achieved using only 5% of labeled data. Therefore, Active Learning can aid in the investigation of possible corruption. ASReview is used to facilitate this research. The setup presented in this research could be applied to almost any textual data classification problem.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/43734
        Collections
        • Theses
        Utrecht university logo