Source Code Plagiarism Detection using Machine Learning

Heres, D.

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Hage, J.
dc.contributor.advisor	Feelders, A.J.
dc.contributor.author	Heres, D.
dc.date.accessioned	2017-10-20T17:01:11Z
dc.date.available	2017-10-20T17:01:11Z
dc.date.issued	2017
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/27904
dc.description.abstract	In this thesis we study how we can apply machine learning techniques to improve source code plagiarism detection. We present a system, InfiniteMonkey, that can identify suspicious similarities between source code documents using two methods. For fast retrieval of source code similarities, we use a system based on $n$-gram features, tf-idf weighting and cosine similarity. The second part focuses on applying more complex neural network models trained on a large synthetic source code plagiarism dataset to classify source code plagiarism. This dataset is created using an automatic refactoring system we developed for learning this task. The methods are evaluated and compared to other tools on a number of different datasets. We show that the traditional approach compares well against other approaches, while the deep model on synthetic data does not generalize well to the evaluation tasks. In this thesis we also show a simple technique for visualization of source code similarities.
dc.description.sponsorship	Utrecht University
dc.format.extent	697174
dc.format.mimetype	application/pdf
dc.language.iso	en_US
dc.title	Source Code Plagiarism Detection using Machine Learning
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	plagiarism detection, source code similarity, machine learning
dc.subject.courseuu	Computing Science

Files in this item

Name:: source-code-plagiarism.pdf
Size:: 680.8Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record