Automatic Detection of Linguistic Errors in Dutch LLM-Generated Text

Heijboer, Kevin

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Bylinina, Lisa
dc.contributor.author	Heijboer, Kevin
dc.date.accessioned	2025-08-21T00:02:41Z
dc.date.available	2025-08-21T00:02:41Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/49832
dc.description.abstract	As large language models (LLMs) are increasingly used to generate Dutch book descriptions, ensuring the linguistic quality of their output remains a challenge. This thesis explores whether real human edits can be used to train models that automatically detect linguistically unacceptable sentences. Using versioned summaries from Bookarang, a multi-step filtering pipeline was developed to extract only meaning-preserving linguistic edits, removing content and stylistic changes using sentence alignment, NLI filtering, and GPT-based classification. The result was a dataset of 12,894 labeled sentences for training acceptability classifiers. Transformer models were fine-tuned on this data, with Multilingual BERT achieving 74.3% recall, greatly outperforming a CoLA-NL-trained RobBERT baseline. Threshold tuning further allowed balancing error detection with editorial workload. These results show that edit data can be turned into useful training material through targeted filtering, offering a practical approach to improving quality control for LLM-generated content in real-world editorial settings.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	This master's thesis explores how real-world human edits of Dutch book descriptions can be used to automatically detect linguistic errors in LLM-generated text. By creating a dataset based on Bookarang's editing process and training transformer models, the research demonstrates that models trained on actual editing data are much more effective at classifying linguistic acceptability than traditional datasets like CoLA-NL.
dc.title	Automatic Detection of Linguistic Errors in Dutch LLM-Generated Text
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Linguistic acceptability;LLM-generated text;Dutch NLP;Text quality control; BERT;Automatic error detection;Edit-based training data;Natural language inference;CoLA-NL; Book descriptions
dc.subject.courseuu	Applied Data Science
dc.thesis.id	52078

Files in this item

Name:: ads_masters_thesis_kevin_heijb ...
Size:: 521.3Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record