View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        The importance of domain-specific expertise in training customized Named Entity Recognition models

        Thumbnail
        View/Open
        Bachelor_Thesis_Aniek_Brandt_6253458.pdf (518.5Kb)
        Publication date
        2021
        Author
        Brandt, A.
        Metadata
        Show full item record
        Summary
        The Dutch Expertise Center of Human Trafficking and Human Smuggling aims to use online media archives to extract articles that are useful for them. One approach for this is creating a custom Named Entity Recognition model. Named Entity Recognition (NER) is a subtask of Information Extraction (IE). Its goal is to extract certain ‘named entities’ from unstructured text. These entities used to only be proper names, but today NER encompasses the extraction of all important entities within a given context [17]. When creating a custom NER model, the entities that are extracted are are by definition very domain-specific. Because of this, big, annotated training corpora usually do not exist for custom NER models and annotation is done by hand. This probes the question whether annotation should be done by people with knowledge of the given domain, or by people with knowledge of NER. In this report, a custom NER model created by using the SpaCy library is trained on a dataset that is annotated by either a fourth year AI student or employees of the Expertise Center. This was done in order to assess the importance of domain-specific knowledge in annotating data for custom NER models. Different properties of the annotated datasets are analyzed, as well as the performance of the models. The models trained on the dataset annotated by the AI student slightly outperformed those trained on the dataset annotated by the Expertise Center, but not by a great margin. Most of all, the outcome of the research suggests a trade-off between extracting certain, extremely specific entities and creating a model that performs and generalizes well.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/41157
        Collections
        • Theses
        Utrecht university logo