Performance of LLM-written text detectors across domains and under adversarial attack
Summary
Large Language Models (LLMs) have greatly improved the diversity and quality of machine-generated text. So much so that humans score at chance when distinguishing human-written texts from LLM-generated texts. Associated risks include accelerating phishing, disinformation, fraudulent product reviews, academic dishonesty, and spam. Detecting LLM-generated text could prove crucial in mitigating these risks. Many detectors have been proposed, however, past work has mainly focused on building detectors within one domain, on the output of one LLM. The most performant detector seems to be a fine-tuned masked Language Model (LM) with a classification head. But these detectors struggle with several issues such as lack of interpretability, difficulty in generalizing to unseen domains, and lack of robustness to adversarial attacks. This study sheds a light on the performance and robustness of various LLM-generated text detectors across 10 different domains, as well as investigate if robustness can be improved through data augmentation. We provide interpretable baselines for each domain, as well as a comparison between a fine-tuned LM trained on all domain data and an in-domain fine-tuned LM. We first show that a fined-tuned LM detector trained on multiple domains indeed has trouble generalizing to an unseen domain. We then show that performance of various detectors varies between domains. In some domains a detector trained on all domains leads to better performance, while on others fine-tuning within domain is better. We then attack detectors in different domains with a character level attack and paraphrasing attack, and show that models are of variable robustness depending on the domain. We finally show that our fine-tuned LM detector trained on student-written essays, can be made robust to character level attacks through data augmentation, most effectively by adding paraphrases to the training data.