Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorOmmen, Thijs van
dc.contributor.authorPol, Tijmen van den
dc.date.accessioned2021-12-10T00:00:16Z
dc.date.available2021-12-10T00:00:16Z
dc.date.issued2021
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/284
dc.description.abstractSonarQube is a popular free tool for automatically determining bugs and code smells in code. It is for developers interesting to know which rules that SonarQube uses are important to solve and which are less important. We conduct a research with a dataset that connects SonarQube rule violations to faults. This dataset is made by using a more recent version of SonarQube than was done before. Machine Learning (ML) is used to obtain feature importance that shows which rules are really important. We take a look at these importances by taking into account that the dataset is imbalanced. We use oversampling and ML methods that are good in dealing with imbalanced data. For determining the best rules by importances, we focus on rules that can predict faults well and we focus less on classifying the most entries in the dataset correctly, since the dataset is imbalanced. We do so by looking at G-mean and F-Beta importances. Furthermore, we look into the difference between permutation and drop-column importance, since the paper that this work is inspired on (Lenarduzzi, Lomio, Huttunen, et al., 2020) used drop-column and we will use permutation importance. We find that GradientBoost scores best at AUC, G-mean and F1-score. RandomForest scores best at F5-score. We observe that SonarQube 7 rules do a good job at predicting faults. However, we also find that a ML method variable importance is different from what we might find important for a SonarQube rule. When we look at the importance of types, we find that bugs are relevant for predicting faults, which is different from the paper that this research was base on Lenarduzzi, Lomio, Huttunen, et al., 2020.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectWe conduct a research with a dataset that connects SonarQube rule violations to faults. This dataset is made by using a more recent version of SonarQube than was done before. Machine Learning (ML) is used to obtain feature importance that shows which rules are really important. We take a look at these importances by taking into account that the dataset is imbalanced.
dc.titleSonarQube rule violations that actually lead to bugs
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsSonarQube; Software Quality; Machine Learning; imbalanced dataset; Technical Debt Dataset
dc.subject.courseuuComputing Science
dc.thesis.id1221


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record