Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorSiebes, Arno
dc.contributor.authorZając, Ola
dc.date.accessioned2025-09-03T23:02:05Z
dc.date.available2025-09-03T23:02:05Z
dc.date.issued2025
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/50307
dc.description.abstractOrganization name matching is essential for data integration but presents unique challenges due to inconsistent formatting, abbreviations, typographical and phonetic variations, and the lack of grammatical structure or contextual cues. Traditional string-based methods perform well on predictable name patterns but struggle with semantic variation. This thesis explores the use of transformer-based embeddings, particularly Sentence Transformers, for organization name similarity. I compare a string similarity baseline (Levenshtein), a pretrained Sentence Transformer (MiniLM), and several fine-tuned variants across two real- world datasets (GLEIF, JRC-Names) and a set of synthetically generated name variants. Results show that transformer embeddings generalize well to complex name differences, with fine-tuning further improving performance. I also demonstrate that vector search using a vector database (Qdrant) significantly improves retrieval speed without sacrificing accuracy. The findings suggest that transformer-based approaches, especially when combined with scalable retrieval infrastructure, offer a robust and efficient solution for organization name matching tasks in real-world settings.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectOrganization name matching is essential for data integration but presents unique challenges due to inconsistent formatting, abbreviations, typographical and phonetic variations, and the lack of grammatical structure or contextual cues. Traditional string-based methods perform well on predictable name patterns but struggle with semantic variation. This thesis explores the use of transformer-based embeddings, particularly Sentence Transformers, for organization name similarity.
dc.titleGenerative AI for AI-driven data management: Name similarity with transformers for entity matching
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsOrganization name matching; entity matching; text similarity; string similarity; Levenshtein distance; Sentence Transformers; transformer embeddings; MiniLM; fine-tuning; contrastive learning; vector search; vector databases; semantic similarity; information retrieval; data integration; scalable retrieval
dc.subject.courseuuApplied Data Science
dc.thesis.id53520


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record