Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorExterne beoordelaar - External assesor,
dc.contributor.authorScholten, Niels
dc.date.accessioned2024-05-08T23:01:43Z
dc.date.available2024-05-08T23:01:43Z
dc.date.issued2024
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/46384
dc.description.abstractTopic modeling is a method for generating prevalent themes in large collections of natural language documents. Recently, representations of documents as a distribution of topics have been used as features for text classification. The classification can then be explained based on topics in the document, which is helpful in high-stakes decision-making. However, most topic modeling techniques do not consider the classification variable, which could result in topics that are not useful as features for classification. Previous studies addressed this problem by incorporating classification labels in the topic modeling process. These "supervised" topic models exhibit significant classification performance gains. Recently, there has been a trend of finding topics in text documents by clustering in pretrained word embedding spaces. This method helps advancethe handling of ambiguous and context-dependent words. However, no topic model exists that clusters in word embedding space while incorporating classification labels. This paper introduces sTopClus: A supervised topic model based on TopClus [1], which clusters BERT embeddings to find topics effective for text classification. A comparative study between sTopClus and other topic models showed that topics generated by sTopClus were more suitable for linear text classification. However, while automatic topic coherence metrics were inconclusive, the qualitative analysis concluded that sTopClus topics were more difficult to interpret. Further investigation showed that sTopClus suffered from a misoptimization problem, which severely hampered its interpretability. This paper thoroughly documents the misoptimization problem and attempts to alleviate it. Furthermore, an analysis has been performed on the core architectural choices for developing a supervised topic model that clusters in contextualized word embeddings. Lastly, the implications of this research for supervised topic modeling using contextualized word embeddings are discussed.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectThis research is aimed at finding topics in text documents that reflect the content of the document as well as topics that can act as useful features for classification. The model described in this paper combines multiple loss functions to find latent word and document embeddings that form coherent clusters.
dc.titleInterpretable Text Classification through Topic Modeling by Clustering in Word Embedding Spaces
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsTopic Modelling, Word embeddings, Clustering
dc.subject.courseuuArtificial Intelligence
dc.thesis.id30684


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record