dc.rights.license | CC-BY-NC-ND | |
dc.contributor.advisor | Siebes, Arno | |
dc.contributor.author | Zając, Ola | |
dc.date.accessioned | 2025-09-03T23:02:05Z | |
dc.date.available | 2025-09-03T23:02:05Z | |
dc.date.issued | 2025 | |
dc.identifier.uri | https://studenttheses.uu.nl/handle/20.500.12932/50307 | |
dc.description.abstract | Organization name matching is essential for data integration but presents unique challenges
due to inconsistent formatting, abbreviations, typographical and phonetic variations, and the
lack of grammatical structure or contextual cues. Traditional string-based methods perform
well on predictable name patterns but struggle with semantic variation. This thesis explores
the use of transformer-based embeddings, particularly Sentence Transformers, for
organization name similarity. I compare a string similarity baseline (Levenshtein), a
pretrained Sentence Transformer (MiniLM), and several fine-tuned variants across two real-
world datasets (GLEIF, JRC-Names) and a set of synthetically generated name variants.
Results show that transformer embeddings generalize well to complex name differences, with
fine-tuning further improving performance. I also demonstrate that vector search using a
vector database (Qdrant) significantly improves retrieval speed without sacrificing accuracy.
The findings suggest that transformer-based approaches, especially when combined with
scalable retrieval infrastructure, offer a robust and efficient solution for organization name
matching tasks in real-world settings. | |
dc.description.sponsorship | Utrecht University | |
dc.language.iso | EN | |
dc.subject | Organization name matching is essential for data integration but presents unique challenges
due to inconsistent formatting, abbreviations, typographical and phonetic variations, and the
lack of grammatical structure or contextual cues. Traditional string-based methods perform
well on predictable name patterns but struggle with semantic variation. This thesis explores
the use of transformer-based embeddings, particularly Sentence Transformers, for
organization name similarity. | |
dc.title | Generative AI for AI-driven data management: Name similarity with transformers for entity matching | |
dc.type.content | Master Thesis | |
dc.rights.accessrights | Open Access | |
dc.subject.keywords | Organization name matching; entity matching; text similarity; string similarity; Levenshtein distance; Sentence Transformers; transformer embeddings; MiniLM; fine-tuning; contrastive learning; vector search; vector databases; semantic similarity; information retrieval; data integration; scalable retrieval | |
dc.subject.courseuu | Applied Data Science | |
dc.thesis.id | 53520 | |