On the classification of imbalanced image datasets

Claessen, O.

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	de Campos, C.P.
dc.contributor.advisor	Feelders, A.J.
dc.contributor.author	Claessen, O.
dc.date.accessioned	2019-08-21T17:00:32Z
dc.date.available	2019-08-21T17:00:32Z
dc.date.issued	2019
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/33513
dc.description.abstract	In certain complex real-world problems such as fraud detection and disaster prediction, some instances of classes are more rare than other instances of classes in the dataset making the dataset imbalanced. When working in such important domains, some classes might have very few instances while still being very important for the classification task. An intuitive example of a minority class which is important to learn and predict in an imbalanced dataset is an instance of a fraud case when trying to detect fraud. As fraudulent transactions are not as common as non-fraudulent transactions, these instances might be hard to learn as there is less data to train the classifiers on while this class is still the most important class to predict in the dataset. In order to solve these problems, the problem of imbalanced datasets has to be addressed. The goal of this thesis is to construct a classification model that can predict classes of image data in an imbalanced dataset. The dataset that is used for this research is the Common Objects in Context (COCO) dataset by Microsoft, this dataset contains 80 classes with occluded and cluttered images of these instances and is quite imbalanced. The goal is to research what kind of classifiers perform well on such types of imbalanced dataset, by using a convolutional Neural Network (CNN), random forest (RF) and support vector machine (SVM). The RF and SVM classifiers use pre-extracted features such as histogram of oriented gradients (HOG) and DAISY. Out of the three classifiers that were used, the neural network generally performed better than the RF and SVM. This is probably due to these predictors needing pre-extracted features which makes them less flexible in extracting meaningful features from images. The Neural network performance F1 micro (0.425) outperformed the baseline dummy classifier F1 micro (0.306) while only using 10% of the entire COCO dataset. This research shows that different approaches have the potential to construct models which have the potential to be used for multi-class classification tasks on imbalanced datasets. There are implications that the techniques which were used during this research can be finetuned and optimized even further which in turn leads to better results.
dc.description.sponsorship	Utrecht University
dc.format.extent	2037797
dc.format.mimetype	application/pdf
dc.language.iso	en_US
dc.title	On the classification of imbalanced image datasets
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Machine learning, Imbalanced datasets, Sampling, Multi-class classification, Image classification
dc.subject.courseuu	Artificial Intelligence

Files in this item

Name:: On_the_classification_of_imbal ...
Size:: 1.943Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record