Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorHage, J.
dc.contributor.advisorFeelders, A.J.
dc.contributor.authorHeres, D.
dc.date.accessioned2017-10-20T17:01:11Z
dc.date.available2017-10-20T17:01:11Z
dc.date.issued2017
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/27904
dc.description.abstractIn this thesis we study how we can apply machine learning techniques to improve source code plagiarism detection. We present a system, InfiniteMonkey, that can identify suspicious similarities between source code documents using two methods. For fast retrieval of source code similarities, we use a system based on $n$-gram features, tf-idf weighting and cosine similarity. The second part focuses on applying more complex neural network models trained on a large synthetic source code plagiarism dataset to classify source code plagiarism. This dataset is created using an automatic refactoring system we developed for learning this task. The methods are evaluated and compared to other tools on a number of different datasets. We show that the traditional approach compares well against other approaches, while the deep model on synthetic data does not generalize well to the evaluation tasks. In this thesis we also show a simple technique for visualization of source code similarities.
dc.description.sponsorshipUtrecht University
dc.format.extent697174
dc.format.mimetypeapplication/pdf
dc.language.isoen_US
dc.titleSource Code Plagiarism Detection using Machine Learning
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsplagiarism detection, source code similarity, machine learning
dc.subject.courseuuComputing Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record