View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Theoretical and Practical Aspects of Isolation Forest

        Thumbnail
        View/Open
        Thesis Mark Sterkenburg (5648831).pdf (4.332Mb)
        Publication date
        2022
        Author
        Sterkenburg, Mark
        Metadata
        Show full item record
        Summary
        Outlier detection methods are becoming increasingly more popular, for example, in the financial world to detect fraudulent transactions. In this thesis, we explore the Isolation Forest (IF) algorithm which is a data-driven anomaly detection method. This method distinguishes itself from other outlier detecting methods, because it isolates the outliers directly instead of creating a profile of the normal instances. However, there is not much theory known behind this algorithm and therefore this is explored in this thesis. Theory based on the number of random splits needed to isolate a datapoint is developed and numerically validated with the IF algorithm. Moreover, new outlier detection methods are developed by combining the original IF algorithm with the theoretical formulas of this algorithm. With these methods, the outliers in multi-dimensional datasets can be detected due to the projection they are using to transform multi-dimensional datasets into one-dimensional ones. We test these methods rigorously for multiple different datasets and that shows us very good performances of some of these new methods. Furthermore, the original IF algorithm is used for further testing of multiple components of this algorithm. For example, the impact of pruning the isolation trees and the number of trees on the performance of the algorithm are tested. Also, different scoring functions are tested in combination with this algorithm. The IF algorithm is also compared with two other outlier detection methods and shows very good results when detecting the most outlying points of multiple random datasets. Finally, a real-world financial dataset is used to test the new methods and the original IF method on. On this dataset, the Transformed IF method gives more accurate results than the original IF algorithm.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/42666
        Collections
        • Theses
        Utrecht university logo