View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Interpretable Text Classification through Topic Modeling by Clustering in Word Embedding Spaces

        Thumbnail
        View/Open
        Thesis_Niels_Scholten_Final_Version.pdf (2.272Mb)
        Publication date
        2024
        Author
        Scholten, Niels
        Metadata
        Show full item record
        Summary
        Topic modeling is a method for generating prevalent themes in large collections of natural language documents. Recently, representations of documents as a distribution of topics have been used as features for text classification. The classification can then be explained based on topics in the document, which is helpful in high-stakes decision-making. However, most topic modeling techniques do not consider the classification variable, which could result in topics that are not useful as features for classification. Previous studies addressed this problem by incorporating classification labels in the topic modeling process. These "supervised" topic models exhibit significant classification performance gains. Recently, there has been a trend of finding topics in text documents by clustering in pretrained word embedding spaces. This method helps advancethe handling of ambiguous and context-dependent words. However, no topic model exists that clusters in word embedding space while incorporating classification labels. This paper introduces sTopClus: A supervised topic model based on TopClus [1], which clusters BERT embeddings to find topics effective for text classification. A comparative study between sTopClus and other topic models showed that topics generated by sTopClus were more suitable for linear text classification. However, while automatic topic coherence metrics were inconclusive, the qualitative analysis concluded that sTopClus topics were more difficult to interpret. Further investigation showed that sTopClus suffered from a misoptimization problem, which severely hampered its interpretability. This paper thoroughly documents the misoptimization problem and attempts to alleviate it. Furthermore, an analysis has been performed on the core architectural choices for developing a supervised topic model that clusters in contextualized word embeddings. Lastly, the implications of this research for supervised topic modeling using contextualized word embeddings are discussed.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/46384
        Collections
        • Theses
        Utrecht university logo