View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Keyed Alike: Towards Versatile Domain-Specific Keyword Extraction with BERT

        Thumbnail
        View/Open
        THESIS.pdf (3.815Mb)
        Publication date
        2022
        Author
        Pozzi, Lorenzo
        Metadata
        Show full item record
        Summary
        Automatic Keyword Extraction involves the identification of representative terms in a passage of interest. There are various applications, including topic modeling, semantic search, information retrieval, and text summarization where a set of key- words is highly effective. Past research showed the potential of Transformer models for the task. However, such architectures are limited by the need for labeled datasets that require time and effort to be annotated. The present research seeks to overcome this limitation by proposing an alternative annotation approach for Automatic Key- word Extraction corpora, generalizable to diverse domains with reduced costs and production time. Then, a model based on Bidirectional Encoder Representations from Transformers (BERT) will be fine-tuned to extract domain-specific terms from the generated dataset. The experiments aim to corroborate the designed annota- tion procedure and shed light on BERT’s capability in recognizing relevant terms for domain-specific documents. This thesis also proposes an analysis of the word space generated by BERT in order to study the effect of fine-tuning on Automatic Keyword Extraction. The results showed that the proposed solution for dataset an- notation was effective and that the implemented BERT-based model outperformed the baselines in all the proposed tasks. Moreover, the final analysis indicates that BERT’s word space follows a semantic coherence since the generated embeddings are arranged based on the relatedness to the target domain.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/43073
        Collections
        • Theses
        Utrecht university logo