Keyed Alike: Towards Versatile Domain-Specific Keyword Extraction with BERT
Summary
Automatic Keyword Extraction involves the identification of representative terms in a passage of interest. There are various applications, including topic modeling, semantic search, information retrieval, and text summarization where a set of key- words is highly effective. Past research showed the potential of Transformer models for the task. However, such architectures are limited by the need for labeled datasets that require time and effort to be annotated. The present research seeks to overcome this limitation by proposing an alternative annotation approach for Automatic Key- word Extraction corpora, generalizable to diverse domains with reduced costs and production time. Then, a model based on Bidirectional Encoder Representations from Transformers (BERT) will be fine-tuned to extract domain-specific terms from the generated dataset. The experiments aim to corroborate the designed annota- tion procedure and shed light on BERT’s capability in recognizing relevant terms for domain-specific documents. This thesis also proposes an analysis of the word space generated by BERT in order to study the effect of fine-tuning on Automatic Keyword Extraction. The results showed that the proposed solution for dataset an- notation was effective and that the implemented BERT-based model outperformed the baselines in all the proposed tasks. Moreover, the final analysis indicates that BERT’s word space follows a semantic coherence since the generated embeddings are arranged based on the relatedness to the target domain.