Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorZervanou, Kalliopi
dc.contributor.authorBerk, Rens
dc.date.accessioned2025-08-28T00:03:12Z
dc.date.available2025-08-28T00:03:12Z
dc.date.issued2025
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/50062
dc.description.abstractThe Employee Insurance Agency (’Uitvoeringsinstituut Werknemersverzekeringen’ (UWV)) is the Dutch public body responsible for administering employee insurance schemes, unemployment benefits, and labour-market services. This study investigates whether Natural Language Processing (NLP) could automate the process of measuring the similarity of content between the UWV’s reconciliation validations and the paragraphs in the Payroll Tax Handbook, which is published by the Dutch Tax and Customs Administration (’Belastingdienst’). The dataset includes 64 reconciliation validations and 6,290 paragraphs. Three different Bidirectional Encoder Representations from Transformers (BERT) models were benchmarked against a Term Frequency–Inverse Document Frequency (TF-IDF) model, which was used as the baseline. The TF-IDF model achieves the highest F1-score at a cosine similarity threshold of 0.40 due to substantial lexical overlap between the two documents. Of the BERT models, Sentence-BERT (SBERT) model performed best, but still lagged behind the lexical approach (TF-IDF). The study’s strengths include its openly released dataset, which can be used for a more comprehensive approach. The limitations of this study include the absence of metadata features and the lack of validation by legal experts. The findings suggest that simple lexical methods remain highly competitive in specialised legal and administrative domains, and that future work should explore domain-specific pre-training, ensemble rankers and evaluating top-k candidates to improve semantic alignment. More broadly, the research demonstrates how pragmatic natural language processing (NLP) can bridge the gap between regulatory texts and administrative practice, offering a scalable template for the digitalisation of rule-based public-sector workflows.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectThis thesis explores using NLP to automate similarity matching between UWV reconciliation validations and the Dutch Payroll Tax Handbook. Benchmarking BERT models against a TF-IDF baseline, the study finds TF-IDF performs best due to strong lexical overlap. Results highlight the effectiveness of simple methods in legal domains and suggest future work in domain-specific pre-training and ensemble ranking for improved semantic alignment.
dc.titleIdentifying Inconsistencies Between Internal Manuals and Official Regulations Using Transformer-Based Text Analysis
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.courseuuApplied Data Science
dc.thesis.id52714


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record