Scalable Frequent Pattern Mining using Relational Databases

Schuiling, N.H.

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Philippi, H.
dc.contributor.author	Schuiling, N.H.
dc.date.accessioned	2014-09-16T17:00:54Z
dc.date.available	2014-09-16T17:00:54Z
dc.date.issued	2014
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/18326
dc.description.abstract	Frequent pattern mining is a computationally expensive preprocessing step for association rule mining and involves major challenges when data sets are large. Many sophisticated algorithms have been proposed since its introduction, however most of them don't address the real difficulty: scalability. In this thesis we study how to leverage RDBMS integration to improve the scalability of frequent pattern mining. We propose a hybrid framework based on FP-growth, a popular frequent pattern mining algorithm. For FP-tree construction we develop an SQL-based approach and for mining patterns we develop an external application that interfaces with the RDBMS. Subsequently, we adapt this framework to achieve high scalability, by sharding the transaction database and distributing computation over multiple computing instances. Experimental results with MonetDB5 show that this framework is capable of handling large data sets using a scale out strategy. It achieves near-linear speedup when computing instances are added. Finally, we develop an FP-growth based top-k frequent pattern mining method and integrate this method in our framework. To mine the top-k frequent patterns efficiently, we change the FP-growth mining algorithm to mine patterns in support descending order using a level-wise search strategy instead of the depth-first-search strategy of the original algorithm. Experimental results show that this method has surprisingly good performance on a large data set.
dc.description.sponsorship	Utrecht University
dc.format.extent	1363464
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.title	Scalable Frequent Pattern Mining using Relational Databases
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	frequent pattern mining;data mining;fp-growth;databases;query processing;top-k;distributed algorithms
dc.subject.courseuu	Computing Science

Files in this item

Name:: Thesis.pdf
Size:: 1.300Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record