Ensemble of Code Tables
Summary
In this master thesis non-disjoint clustering algorithms are presented, which are based on the Minimum Description Length (MDL) principle. The algorithms capture the underlying distribution from different perspectives by compressing the data using a series of code tables. A cover algorithm describes how to compress the database using a code table. Every code table is iteratively grown until compression does not improve any more. Experiments show that the algorithms are able to identify structure in the data because the data gets compressed to some extent by the code tables. Clustering experiments show that the general structure is captured by all obtained code tables and that the different groups of patterns that are dissimilar to the general patterns, are captured by different code tables. This confirms that the code tables view the data from different perspectives. The classification experiments show that, given the class labels, the code tables are dissimilar enough to capture the different characteristics of the classes. Without the class labels it is able to find the difference between the classes when the support is sufficiently low. It is also possible to identify multi-valued dependencies in the data. This is the case when code tables in a single iteration are anti-chains and later end up in the same code table.