Automatic Detection of Linguistic Errors in Dutch LLM-Generated Text

Heijboer, Kevin

View/Open

ads_masters_thesis_kevin_heijboer_publication.pdf (521.3Kb)

Publication date

2025

Author

Heijboer, Kevin

Metadata

Show full item record

Summary

As large language models (LLMs) are increasingly used to generate Dutch book descriptions, ensuring the linguistic quality of their output remains a challenge. This thesis explores whether real human edits can be used to train models that automatically detect linguistically unacceptable sentences. Using versioned summaries from Bookarang, a multi-step filtering pipeline was developed to extract only meaning-preserving linguistic edits, removing content and stylistic changes using sentence alignment, NLI filtering, and GPT-based classification. The result was a dataset of 12,894 labeled sentences for training acceptability classifiers. Transformer models were fine-tuned on this data, with Multilingual BERT achieving 74.3% recall, greatly outperforming a CoLA-NL-trained RobBERT baseline. Threshold tuning further allowed balancing error detection with editorial workload. These results show that edit data can be turned into useful training material through targeted filtering, offering a practical approach to improving quality control for LLM-generated content in real-world editorial settings.

URI

https://studenttheses.uu.nl/handle/20.500.12932/49832

Collections

Theses