Text Classification of a Small and Imbalanced Dataset With Long Texts

Apallius de Vos, Isa

View/Open

M_thesis_FINAL.pdf (782.4Kb)

Publication date

2023

Author

Apallius de Vos, Isa

Metadata

Show full item record

Summary

Text classification describes the process of categorising documents into groups based on certain features in their content. Previous research on this topic has focused on testing how specific data attributes such as data size, class distribution, or document length influence the performance of certain types of classifier, but no research showed any findings on classifier performance if a dataset had many of these attributes at once. This thesis thus focuses on getting more insight into how using a dataset with multiple limiting attributes influences different classification models, and which type of model would work best on a dataset with multiple limiting attributes. To do this, a multi-label dataset of 403 labelled Dutch letters of objection with 24 different labels was created, after which three different simple models (Decision Tree, Naive Bayes, and SVM) and a pre-trained language model (BERTje) were trained on the dataset. After testing the models and comparing micro and macro F1-scores, it was found that the language model could not outperform the simpler models on the text classification task. The language of the texts and the class distribution in the dataset were not shown to greatly influence the models' performance, whereas the small data size was found to be the main data attribute limiting the performance of all models. Interestingly, some classifiers could obtain high F1-scores for some of the very small classes in the dataset, indicating that the documents do contain information on those subjects that could easily be extracted by a model if the right technique was used. It is thus proposed to further test the models on individual classes and to inspect a different, rule-based approach to classification to see whether model performance can be improved on this classification task.

URI

https://studenttheses.uu.nl/handle/20.500.12932/43733

Collections

Theses