Theoretical and Practical Aspects of Isolation Forest
Summary
Outlier detection methods are becoming increasingly more popular, for example, in the financial world to detect fraudulent transactions. In this thesis, we explore the Isolation Forest (IF) algorithm which is a data-driven anomaly detection method. This method distinguishes itself from other outlier detecting methods, because it isolates the outliers directly instead of creating a profile of the normal instances. However, there is not much theory known behind this algorithm and therefore this is explored in this thesis. Theory based on the number of random splits needed to isolate a datapoint is developed and numerically validated with the IF algorithm. Moreover, new outlier detection methods are developed by combining the original IF algorithm with the theoretical formulas of this algorithm. With these methods, the outliers in multi-dimensional datasets can be detected due to the projection they are using to transform multi-dimensional datasets into one-dimensional ones. We test these methods rigorously for multiple different datasets and that shows us very good performances of some of these new methods. Furthermore, the original IF algorithm is used for further testing of multiple components of this algorithm. For example, the impact of pruning the isolation trees and the number of trees on the performance of the algorithm are tested. Also, different scoring functions are tested in combination with this algorithm. The IF algorithm is also compared with two other outlier detection methods and shows very good results when detecting the most outlying points of multiple random datasets. Finally, a real-world financial dataset is used to test the new methods and the original IF method on. On this dataset, the Transformed IF method gives more accurate results than the original IF algorithm.