Evaluation of state-of-the-art machine learning approaches on the detection of variations for entity mentions

Bawi, M. Al

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Schraagen, Dr. M
dc.contributor.advisor	van Deemter, Prof. dr. C.J.
dc.contributor.author	Bawi, M. Al
dc.date.accessioned	2020-12-18T19:00:13Z
dc.date.available	2020-12-18T19:00:13Z
dc.date.issued	2020
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/38367
dc.description.abstract	Natural Language Processing is concerned with the interactions between computers and natural languages. Named Entity Recognition is an important subject within the field of Natural Language Processing. Finding entities within texts is the main goal of Named Entity Recognition, this can be done in numerous ways. Another important task in Natural Language Processing is Entity Linkage, in which unique entities are assigned to entities in texts. Although research on Natural Language Processing has come a long way, we still experience issues recognizing and linking entities in modern-day texts. Modern-day texts, used within social media, often consist of short texts which provide limited context and make it harder to find and link entities. Not only do modern-day texts lack enough context, but they are also written by various authors. Having text which is written by various authors gives rise to the entity name variation problem. Having many possible mention variations, with little contextual information, is why short texts are often seen as troublesome for many recognition and linking tasks. In this research, we will look at name variations for names in short texts. Semitic names often have a huge amount of variations when transliterated to Latin script, that is why we will try to look at some Semitic names written in English to test our hypotheses. To link entity mention variations we need to be able to classify a tweet based on the entity that is mentioned. One way to classify tweets based on mentions of an entity, regardless of variation, is to look at the context of an entity. In our research, we will try and research three methods that can be used to correctly link entities in the presence of name variations found in short texts. Firstly we will try and see how well we can link entities in the presence of name variation using a Logistic Regression classifier. Then we will try to link entity name variations by classification using a Convolutional Neural Network. Finally, we will look at Topic modeling as an approach to cluster our short texts allowing us to group variations of entity names. We test our models on various tasks in which we test the influence of the number of entities, the number of variations, and the available context. For our research, we will be using different types of datasets. Our first dataset contains 10 unique entities with names from Semitic origin, found in 3874 unique tweets. A second dataset is also used which contained 753 tweets with 2 entities that have the same entity mention, accompanied by a group of unknown entities with the same entity mentions. Our research shows that depending on the number of entities and training knowledge, we can choose a model that might fit best for each situation. With a higher amount of entities, it seems that Convolutional Neural Network models perform better than the other models. In a low number of entities, we can show that Topic Modeling might become a better alternative to classify entities. We also look at how our models behave when trying to differentiate between entities with the same names. For example, tweets containing the same entity mention "Muhammad" whilst referring to different entities. Other research on the topic of entity linking is often based on different types of data or different datasets, this means that other research on entity linking is not suitable to compare with our data. To get an indication of how well our models work, we will compare our models with baseline models that we created. Our research showed us that Topic Modeling had promising results when differentiating entities with the same name.
dc.description.sponsorship	Utrecht University
dc.format.extent	1848711
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.title	Evaluation of state-of-the-art machine learning approaches on the detection of variations for entity mentions
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Artificial Intelligence, Language, Natural Language Processing, Entity Linking, Entity Classification, Twitter, Arabic, Name variation, Topic Modeling, Logistic Regresion, Convolutional Neural Networks, Levenshtein
dc.subject.courseuu	Artificial Intelligence

Files in this item

Name:: Mustafa_AI_Thesis (13).pdf
Size:: 1.763Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record