The Detection of AI Generated Coding Content
Summary
The rise of large language models (LLMs) has significantly increased productivity and reduced workloads by enabling the automatic generation of software code based on user prompts. However, the indiscriminate usage of AI for generating coding content has also brought concerns, such as the generation of malicious software, cheating in programming education, and the encroachment on intellectual property.
These issues call for robust methods to automatically detect Artificial Intelligence Generated Coding Content (AIGCC). This paper contributes to the research in the detection of AIGCC by exploring classical detection methods, as well as transformer based detection methods based on CodeBERT. These methods were employed on the tasks of binary classification and author detection on solutions of the Automated Programming Progress Standard (APPS) generated by open-source LLMs.
All the detection methods used surpass the previous SOTA, DetectGPT4Code, on the task of binary classification, albeit with reduced generalizability to out-of-distribution data. We also show the importance of natural language comments in the performance of a detector, as well as the increase in performance of specific detectors over more general detectors. A fine-tuned CodeBERT model also has the ability to perform detection of the author of a sample with reasonable performance.
These results indicate the potential utility of these detection methods in various applications, though their deployment should be considered carefully.