dc.rights.license | CC-BY-NC-ND | |
dc.contributor.advisor | Bylinina, Lisa | |
dc.contributor.author | Heijboer, Kevin | |
dc.date.accessioned | 2025-08-21T00:02:41Z | |
dc.date.available | 2025-08-21T00:02:41Z | |
dc.date.issued | 2025 | |
dc.identifier.uri | https://studenttheses.uu.nl/handle/20.500.12932/49832 | |
dc.description.abstract | As large language models (LLMs) are increasingly used to generate Dutch book descriptions, ensuring the linguistic quality of their output remains a challenge. This thesis explores whether real human edits can be used to train models that automatically detect linguistically unacceptable sentences. Using versioned summaries from Bookarang, a multi-step filtering pipeline was developed to extract only meaning-preserving linguistic edits, removing content and stylistic changes using sentence alignment, NLI filtering, and GPT-based classification. The result was a dataset of 12,894 labeled sentences for training acceptability classifiers. Transformer models were fine-tuned on this data, with Multilingual BERT achieving 74.3% recall, greatly outperforming a CoLA-NL-trained RobBERT baseline. Threshold tuning further allowed balancing error detection with editorial workload. These results show that edit data can be turned into useful training material through targeted filtering, offering a practical approach to improving quality control for LLM-generated content in real-world editorial settings. | |
dc.description.sponsorship | Utrecht University | |
dc.language.iso | EN | |
dc.subject | This master's thesis explores how real-world human edits of Dutch book descriptions can be used to automatically detect linguistic errors in LLM-generated text. By creating a dataset based on Bookarang's editing process and training transformer models, the research demonstrates that models trained on actual editing data are much more effective at classifying linguistic acceptability than traditional datasets like CoLA-NL. | |
dc.title | Automatic Detection of Linguistic Errors in Dutch LLM-Generated Text | |
dc.type.content | Master Thesis | |
dc.rights.accessrights | Open Access | |
dc.subject.keywords | Linguistic acceptability;LLM-generated text;Dutch NLP;Text quality control; BERT;Automatic error detection;Edit-based training data;Natural language inference;CoLA-NL; Book descriptions | |
dc.subject.courseuu | Applied Data Science | |
dc.thesis.id | 52078 | |