Naïve Bayes classifier: normally distributed continuous attributes versus the discretized version of those attributes
Summary
In this paper, the differences between training a Naïve Bayes classifier on normally distributed continuous attributes and training a Naïve Bayes classifier on the discretized version of those continuous attributes have been examined. First, the methods that have been used in the experiment have been chosen carefully. To test if an attribute has a normal distribution, the ShapiroWilk test was executed. The discretization has taken place with the unsupervised method equal frequency and a supervised method using the minimum description length principle. Monte-Carlo cross validation was used to get three means of the percentages wrongly classified unseen instances per dataset after 25 runs: when using the continuous attributes themselves and when the attributes were discretized using the two discretization methods. In the results, the means have been tested with the paired t-test. The conclusion is to keep the continuous attributes when dealing with a larger dataset (2280 till 5000 instances) as there were less unseen instances wrongly classified and the true difference between the means was significant. When dealing with a smaller dataset (114 till 250 instances), the true difference between the means was not significant. Only when a smaller dataset has an unbalanced split of number of instances per different classification with a ratio of 1:2.5, keeping the normally distributed continuous attributes resulted in a better accuracy.