Scalable Frequent Pattern Mining using Relational Databases
MetadataShow full item record
Frequent pattern mining is a computationally expensive preprocessing step for association rule mining and involves major challenges when data sets are large. Many sophisticated algorithms have been proposed since its introduction, however most of them don't address the real difficulty: scalability. In this thesis we study how to leverage RDBMS integration to improve the scalability of frequent pattern mining. We propose a hybrid framework based on FP-growth, a popular frequent pattern mining algorithm. For FP-tree construction we develop an SQL-based approach and for mining patterns we develop an external application that interfaces with the RDBMS. Subsequently, we adapt this framework to achieve high scalability, by sharding the transaction database and distributing computation over multiple computing instances. Experimental results with MonetDB5 show that this framework is capable of handling large data sets using a scale out strategy. It achieves near-linear speedup when computing instances are added. Finally, we develop an FP-growth based top-k frequent pattern mining method and integrate this method in our framework. To mine the top-k frequent patterns efficiently, we change the FP-growth mining algorithm to mine patterns in support descending order using a level-wise search strategy instead of the depth-first-search strategy of the original algorithm. Experimental results show that this method has surprisingly good performance on a large data set.