On the influence of dataset characteristics on classifier performance
Summary
The field of Machine Learning has been rapidly gaining attention from both academic and commercial parties. To promote fast deployement of analytical solutions, several tools have been developed to aid the novice user. Concurrently, fields like meta-learning have been making great progress in developing models of algorithm performance on different datasets. One of the central issues in Machine Learning, for both novices and experts, is what learning algorithm to use on a given dataset. Although many solutions have been proposed, a definitive solution has yet to be found. We will argue that a possible solution lies in a deeper understanding of the data we are dealing with. By characterizing datasets in terms of meta-features such as the size of the dataset, we can compare and discuss different datasets and relate them to algorithm performance. A better empirical and analytical understanding of the data may also improve algorithm development, cause significant time-savings and present new insights. Focussing on classification algorithms, we present a number of ways in which meta-features can contribute to machine learning research. We will discuss several challenges and guidelines that have been proposed in the relevant literature and lastly we present what little is known about several meta-features and their relation to a classifier's performance.