Identifying Inconsistencies Between Internal Manuals and Official Regulations Using Transformer-Based Text Analysis

Berk, Rens

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Zervanou, Kalliopi
dc.contributor.author	Berk, Rens
dc.date.accessioned	2025-08-28T00:03:12Z
dc.date.available	2025-08-28T00:03:12Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/50062
dc.description.abstract	The Employee Insurance Agency (’Uitvoeringsinstituut Werknemersverzekeringen’ (UWV)) is the Dutch public body responsible for administering employee insurance schemes, unemployment benefits, and labour-market services. This study investigates whether Natural Language Processing (NLP) could automate the process of measuring the similarity of content between the UWV’s reconciliation validations and the paragraphs in the Payroll Tax Handbook, which is published by the Dutch Tax and Customs Administration (’Belastingdienst’). The dataset includes 64 reconciliation validations and 6,290 paragraphs. Three different Bidirectional Encoder Representations from Transformers (BERT) models were benchmarked against a Term Frequency–Inverse Document Frequency (TF-IDF) model, which was used as the baseline. The TF-IDF model achieves the highest F1-score at a cosine similarity threshold of 0.40 due to substantial lexical overlap between the two documents. Of the BERT models, Sentence-BERT (SBERT) model performed best, but still lagged behind the lexical approach (TF-IDF). The study’s strengths include its openly released dataset, which can be used for a more comprehensive approach. The limitations of this study include the absence of metadata features and the lack of validation by legal experts. The findings suggest that simple lexical methods remain highly competitive in specialised legal and administrative domains, and that future work should explore domain-specific pre-training, ensemble rankers and evaluating top-k candidates to improve semantic alignment. More broadly, the research demonstrates how pragmatic natural language processing (NLP) can bridge the gap between regulatory texts and administrative practice, offering a scalable template for the digitalisation of rule-based public-sector workflows.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	This thesis explores using NLP to automate similarity matching between UWV reconciliation validations and the Dutch Payroll Tax Handbook. Benchmarking BERT models against a TF-IDF baseline, the study finds TF-IDF performs best due to strong lexical overlap. Results highlight the effectiveness of simple methods in legal domains and suggest future work in domain-specific pre-training and ensemble ranking for improved semantic alignment.
dc.title	Identifying Inconsistencies Between Internal Manuals and Official Regulations Using Transformer-Based Text Analysis
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.courseuu	Applied Data Science
dc.thesis.id	52714

Files in this item

Name:: Thesis Rens Berk 9192212 ADS UWV ...
Size:: 473.9Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record