Performance of LLM-written text detectors across domains and under adversarial attack

Lockhorst, Sjors

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Gatt, A.
dc.contributor.author	Lockhorst, Sjors
dc.date.accessioned	2024-07-24T23:06:53Z
dc.date.available	2024-07-24T23:06:53Z
dc.date.issued	2024
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/46899
dc.description.abstract	Large Language Models (LLMs) have greatly improved the diversity and quality of machine-generated text. So much so that humans score at chance when distinguishing human-written texts from LLM-generated texts. Associated risks include accelerating phishing, disinformation, fraudulent product reviews, academic dishonesty, and spam. Detecting LLM-generated text could prove crucial in mitigating these risks. Many detectors have been proposed, however, past work has mainly focused on building detectors within one domain, on the output of one LLM. The most performant detector seems to be a fine-tuned masked Language Model (LM) with a classification head. But these detectors struggle with several issues such as lack of interpretability, difficulty in generalizing to unseen domains, and lack of robustness to adversarial attacks. This study sheds a light on the performance and robustness of various LLM-generated text detectors across 10 different domains, as well as investigate if robustness can be improved through data augmentation. We provide interpretable baselines for each domain, as well as a comparison between a fine-tuned LM trained on all domain data and an in-domain fine-tuned LM. We first show that a fined-tuned LM detector trained on multiple domains indeed has trouble generalizing to an unseen domain. We then show that performance of various detectors varies between domains. In some domains a detector trained on all domains leads to better performance, while on others fine-tuning within domain is better. We then attack detectors in different domains with a character level attack and paraphrasing attack, and show that models are of variable robustness depending on the domain. We finally show that our fine-tuned LM detector trained on student-written essays, can be made robust to character level attacks through data augmentation, most effectively by adding paraphrases to the training data.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	The evaluation of performance of LLM-written text detectors across domains and under adversarial attacks. It evaluates how good different models are at detecting if a text was written by a LLM, or a human. It also tries to asses how easily one can 'fool' such detection models, through various adversarial attacks.
dc.title	Performance of LLM-written text detectors across domains and under adversarial attack
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	34825

Files in this item

Name:: lockhorst_sjors_thesis_final.pdf
Size:: 19.19Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record