Interpretable Text Classification through Topic Modeling by Clustering in Word Embedding Spaces

Scholten, Niels

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Externe beoordelaar - External assesor,
dc.contributor.author	Scholten, Niels
dc.date.accessioned	2024-05-08T23:01:43Z
dc.date.available	2024-05-08T23:01:43Z
dc.date.issued	2024
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/46384
dc.description.abstract	Topic modeling is a method for generating prevalent themes in large collections of natural language documents. Recently, representations of documents as a distribution of topics have been used as features for text classification. The classification can then be explained based on topics in the document, which is helpful in high-stakes decision-making. However, most topic modeling techniques do not consider the classification variable, which could result in topics that are not useful as features for classification. Previous studies addressed this problem by incorporating classification labels in the topic modeling process. These "supervised" topic models exhibit significant classification performance gains. Recently, there has been a trend of finding topics in text documents by clustering in pretrained word embedding spaces. This method helps advancethe handling of ambiguous and context-dependent words. However, no topic model exists that clusters in word embedding space while incorporating classification labels. This paper introduces sTopClus: A supervised topic model based on TopClus [1], which clusters BERT embeddings to find topics effective for text classification. A comparative study between sTopClus and other topic models showed that topics generated by sTopClus were more suitable for linear text classification. However, while automatic topic coherence metrics were inconclusive, the qualitative analysis concluded that sTopClus topics were more difficult to interpret. Further investigation showed that sTopClus suffered from a misoptimization problem, which severely hampered its interpretability. This paper thoroughly documents the misoptimization problem and attempts to alleviate it. Furthermore, an analysis has been performed on the core architectural choices for developing a supervised topic model that clusters in contextualized word embeddings. Lastly, the implications of this research for supervised topic modeling using contextualized word embeddings are discussed.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	This research is aimed at finding topics in text documents that reflect the content of the document as well as topics that can act as useful features for classification. The model described in this paper combines multiple loss functions to find latent word and document embeddings that form coherent clusters.
dc.title	Interpretable Text Classification through Topic Modeling by Clustering in Word Embedding Spaces
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Topic Modelling, Word embeddings, Clustering
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	30684

Files in this item

Name:: Thesis_Niels_Scholten_Final_Ve ...
Size:: 2.272Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record