Source Code Plagiarism Detection using Machine Learning
Summary
In this thesis we study how we can apply machine learning techniques to improve source code plagiarism detection. We present a system, InfiniteMonkey, that can identify suspicious similarities between source code documents using two methods. For fast retrieval of source code similarities, we use a system based on $n$-gram features, tf-idf weighting and cosine similarity. The second part focuses on applying more complex neural network models trained on a large synthetic source code plagiarism dataset to classify source code plagiarism. This dataset is created using an automatic refactoring system we developed for learning this task. The methods are evaluated and compared to other tools on a number of different datasets. We show that the traditional approach compares well against other approaches, while the deep model on synthetic data does not generalize well to the evaluation tasks. In this thesis we also show a simple technique for visualization of source code similarities.