View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Naïve Bayes classifier: normally distributed continuous attributes versus the discretized version of those attributes

        Thumbnail
        View/Open
        scriptie.pdf (1.361Mb)
        Publication date
        2017
        Author
        Majoor, I.A.G.
        Metadata
        Show full item record
        Summary
        In this paper, the differences between training a Naïve Bayes classifier on normally distributed continuous attributes and training a Naïve Bayes classifier on the discretized version of those continuous attributes have been examined. First, the methods that have been used in the experiment have been chosen carefully. To test if an attribute has a normal distribution, the ShapiroWilk test was executed. The discretization has taken place with the unsupervised method equal frequency and a supervised method using the minimum description length principle. Monte-Carlo cross validation was used to get three means of the percentages wrongly classified unseen instances per dataset after 25 runs: when using the continuous attributes themselves and when the attributes were discretized using the two discretization methods. In the results, the means have been tested with the paired t-test. The conclusion is to keep the continuous attributes when dealing with a larger dataset (2280 till 5000 instances) as there were less unseen instances wrongly classified and the true difference between the means was significant. When dealing with a smaller dataset (114 till 250 instances), the true difference between the means was not significant. Only when a smaller dataset has an unbalanced split of number of instances per different classification with a ratio of 1:2.5, keeping the normally distributed continuous attributes resulted in a better accuracy.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/26756
        Collections
        • Theses
        Utrecht university logo