Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorMosteiro Romero, Pablo
dc.contributor.authorApallius de Vos, Isa
dc.date.accessioned2023-03-30T10:00:42Z
dc.date.available2023-03-30T10:00:42Z
dc.date.issued2023
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/43733
dc.description.abstractText classification describes the process of categorising documents into groups based on certain features in their content. Previous research on this topic has focused on testing how specific data attributes such as data size, class distribution, or document length influence the performance of certain types of classifier, but no research showed any findings on classifier performance if a dataset had many of these attributes at once. This thesis thus focuses on getting more insight into how using a dataset with multiple limiting attributes influences different classification models, and which type of model would work best on a dataset with multiple limiting attributes. To do this, a multi-label dataset of 403 labelled Dutch letters of objection with 24 different labels was created, after which three different simple models (Decision Tree, Naive Bayes, and SVM) and a pre-trained language model (BERTje) were trained on the dataset. After testing the models and comparing micro and macro F1-scores, it was found that the language model could not outperform the simpler models on the text classification task. The language of the texts and the class distribution in the dataset were not shown to greatly influence the models' performance, whereas the small data size was found to be the main data attribute limiting the performance of all models. Interestingly, some classifiers could obtain high F1-scores for some of the very small classes in the dataset, indicating that the documents do contain information on those subjects that could easily be extracted by a model if the right technique was used. It is thus proposed to further test the models on individual classes and to inspect a different, rule-based approach to classification to see whether model performance can be improved on this classification task.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectThe thesis looked into the performance of different types of models on a text classification task, where the focus lied on how different data attributes influence model choice and performance and how the limitations of different model types can be overcome.
dc.titleText Classification of a Small and Imbalanced Dataset With Long Texts
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsText classification; Subject classification; Text mining; NLP; BERT
dc.subject.courseuuArtificial Intelligence
dc.thesis.id15433


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record