Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorBylinina, Lisa
dc.contributor.authorHeijboer, Kevin
dc.date.accessioned2025-08-21T00:02:41Z
dc.date.available2025-08-21T00:02:41Z
dc.date.issued2025
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/49832
dc.description.abstractAs large language models (LLMs) are increasingly used to generate Dutch book descriptions, ensuring the linguistic quality of their output remains a challenge. This thesis explores whether real human edits can be used to train models that automatically detect linguistically unacceptable sentences. Using versioned summaries from Bookarang, a multi-step filtering pipeline was developed to extract only meaning-preserving linguistic edits, removing content and stylistic changes using sentence alignment, NLI filtering, and GPT-based classification. The result was a dataset of 12,894 labeled sentences for training acceptability classifiers. Transformer models were fine-tuned on this data, with Multilingual BERT achieving 74.3% recall, greatly outperforming a CoLA-NL-trained RobBERT baseline. Threshold tuning further allowed balancing error detection with editorial workload. These results show that edit data can be turned into useful training material through targeted filtering, offering a practical approach to improving quality control for LLM-generated content in real-world editorial settings.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectThis master's thesis explores how real-world human edits of Dutch book descriptions can be used to automatically detect linguistic errors in LLM-generated text. By creating a dataset based on Bookarang's editing process and training transformer models, the research demonstrates that models trained on actual editing data are much more effective at classifying linguistic acceptability than traditional datasets like CoLA-NL.
dc.titleAutomatic Detection of Linguistic Errors in Dutch LLM-Generated Text
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsLinguistic acceptability;LLM-generated text;Dutch NLP;Text quality control; BERT;Automatic error detection;Edit-based training data;Natural language inference;CoLA-NL; Book descriptions
dc.subject.courseuuApplied Data Science
dc.thesis.id52078


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record