Text Classification of a Small and Imbalanced Dataset With Long Texts

Apallius de Vos, Isa

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Mosteiro Romero, Pablo
dc.contributor.author	Apallius de Vos, Isa
dc.date.accessioned	2023-03-30T10:00:42Z
dc.date.available	2023-03-30T10:00:42Z
dc.date.issued	2023
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/43733
dc.description.abstract	Text classification describes the process of categorising documents into groups based on certain features in their content. Previous research on this topic has focused on testing how specific data attributes such as data size, class distribution, or document length influence the performance of certain types of classifier, but no research showed any findings on classifier performance if a dataset had many of these attributes at once. This thesis thus focuses on getting more insight into how using a dataset with multiple limiting attributes influences different classification models, and which type of model would work best on a dataset with multiple limiting attributes. To do this, a multi-label dataset of 403 labelled Dutch letters of objection with 24 different labels was created, after which three different simple models (Decision Tree, Naive Bayes, and SVM) and a pre-trained language model (BERTje) were trained on the dataset. After testing the models and comparing micro and macro F1-scores, it was found that the language model could not outperform the simpler models on the text classification task. The language of the texts and the class distribution in the dataset were not shown to greatly influence the models' performance, whereas the small data size was found to be the main data attribute limiting the performance of all models. Interestingly, some classifiers could obtain high F1-scores for some of the very small classes in the dataset, indicating that the documents do contain information on those subjects that could easily be extracted by a model if the right technique was used. It is thus proposed to further test the models on individual classes and to inspect a different, rule-based approach to classification to see whether model performance can be improved on this classification task.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	The thesis looked into the performance of different types of models on a text classification task, where the focus lied on how different data attributes influence model choice and performance and how the limitations of different model types can be overcome.
dc.title	Text Classification of a Small and Imbalanced Dataset With Long Texts
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Text classification; Subject classification; Text mining; NLP; BERT
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	15433

Files in this item

Name:: M_thesis_FINAL.pdf
Size:: 782.4Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record