The importance of domain-specific expertise in training customized Named Entity Recognition models
Summary
The Dutch Expertise Center of Human Trafficking and Human Smuggling aims to use online media archives to extract articles that are useful for them. One approach for this is creating a custom Named Entity Recognition model.
Named Entity Recognition (NER) is a subtask of Information Extraction (IE). Its goal is to extract certain ‘named entities’ from unstructured text. These entities used to only be proper names, but today NER encompasses the extraction of all important entities within a given context [17].
When creating a custom NER model, the entities that are extracted are are by definition very domain-specific. Because of this, big, annotated training corpora usually do not exist for custom NER models and annotation is done by hand. This probes the question whether annotation should be done by people with knowledge of the given domain, or by people with knowledge of NER.
In this report, a custom NER model created by using the SpaCy library is trained on a dataset that is annotated by either a fourth year AI student or employees of the Expertise Center. This was done in order to assess the importance of domain-specific knowledge in annotating data for custom NER models. Different properties of the annotated datasets are analyzed, as well as the performance of the models.
The models trained on the dataset annotated by the AI student slightly outperformed those trained on the dataset annotated by the Expertise Center, but not by a great margin. Most of all, the outcome of the research suggests a trade-off between extracting certain, extremely specific entities and creating a model that performs and generalizes well.