Generative AI for AI-driven data management: Name similarity with transformers for entity matching

Zając, Ola

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Siebes, Arno
dc.contributor.author	Zając, Ola
dc.date.accessioned	2025-09-03T23:02:05Z
dc.date.available	2025-09-03T23:02:05Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/50307
dc.description.abstract	Organization name matching is essential for data integration but presents unique challenges due to inconsistent formatting, abbreviations, typographical and phonetic variations, and the lack of grammatical structure or contextual cues. Traditional string-based methods perform well on predictable name patterns but struggle with semantic variation. This thesis explores the use of transformer-based embeddings, particularly Sentence Transformers, for organization name similarity. I compare a string similarity baseline (Levenshtein), a pretrained Sentence Transformer (MiniLM), and several fine-tuned variants across two real- world datasets (GLEIF, JRC-Names) and a set of synthetically generated name variants. Results show that transformer embeddings generalize well to complex name differences, with fine-tuning further improving performance. I also demonstrate that vector search using a vector database (Qdrant) significantly improves retrieval speed without sacrificing accuracy. The findings suggest that transformer-based approaches, especially when combined with scalable retrieval infrastructure, offer a robust and efficient solution for organization name matching tasks in real-world settings.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	Organization name matching is essential for data integration but presents unique challenges due to inconsistent formatting, abbreviations, typographical and phonetic variations, and the lack of grammatical structure or contextual cues. Traditional string-based methods perform well on predictable name patterns but struggle with semantic variation. This thesis explores the use of transformer-based embeddings, particularly Sentence Transformers, for organization name similarity.
dc.title	Generative AI for AI-driven data management: Name similarity with transformers for entity matching
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Organization name matching; entity matching; text similarity; string similarity; Levenshtein distance; Sentence Transformers; transformer embeddings; MiniLM; fine-tuning; contrastive learning; vector search; vector databases; semantic similarity; information retrieval; data integration; scalable retrieval
dc.subject.courseuu	Applied Data Science
dc.thesis.id	53520

Files in this item

Name:: MasterThesis_AOZajac.pdf
Size:: 1.241Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record