View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Generative AI for AI-driven data management: Name similarity with transformers for entity matching

        Thumbnail
        View/Open
        MasterThesis_AOZajac.pdf (1.241Mb)
        Publication date
        2025
        Author
        Zając, Ola
        Metadata
        Show full item record
        Summary
        Organization name matching is essential for data integration but presents unique challenges due to inconsistent formatting, abbreviations, typographical and phonetic variations, and the lack of grammatical structure or contextual cues. Traditional string-based methods perform well on predictable name patterns but struggle with semantic variation. This thesis explores the use of transformer-based embeddings, particularly Sentence Transformers, for organization name similarity. I compare a string similarity baseline (Levenshtein), a pretrained Sentence Transformer (MiniLM), and several fine-tuned variants across two real- world datasets (GLEIF, JRC-Names) and a set of synthetically generated name variants. Results show that transformer embeddings generalize well to complex name differences, with fine-tuning further improving performance. I also demonstrate that vector search using a vector database (Qdrant) significantly improves retrieval speed without sacrificing accuracy. The findings suggest that transformer-based approaches, especially when combined with scalable retrieval infrastructure, offer a robust and efficient solution for organization name matching tasks in real-world settings.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/50307
        Collections
        • Theses
        Utrecht university logo