On the classification of imbalanced image datasets

Claessen, O.

View/Open

On_the_classification_of_imbalanced_image_datasets.pdf (1.943Mb)

Publication date

2019

Author

Claessen, O.

Metadata

Show full item record

Summary

In certain complex real-world problems such as fraud detection and disaster prediction, some instances of classes are more rare than other instances of classes in the dataset making the dataset imbalanced. When working in such important domains, some classes might have very few instances while still being very important for the classification task. An intuitive example of a minority class which is important to learn and predict in an imbalanced dataset is an instance of a fraud case when trying to detect fraud. As fraudulent transactions are not as common as non-fraudulent transactions, these instances might be hard to learn as there is less data to train the classifiers on while this class is still the most important class to predict in the dataset. In order to solve these problems, the problem of imbalanced datasets has to be addressed. The goal of this thesis is to construct a classification model that can predict classes of image data in an imbalanced dataset. The dataset that is used for this research is the Common Objects in Context (COCO) dataset by Microsoft, this dataset contains 80 classes with occluded and cluttered images of these instances and is quite imbalanced. The goal is to research what kind of classifiers perform well on such types of imbalanced dataset, by using a convolutional Neural Network (CNN), random forest (RF) and support vector machine (SVM). The RF and SVM classifiers use pre-extracted features such as histogram of oriented gradients (HOG) and DAISY. Out of the three classifiers that were used, the neural network generally performed better than the RF and SVM. This is probably due to these predictors needing pre-extracted features which makes them less flexible in extracting meaningful features from images. The Neural network performance F1 micro (0.425) outperformed the baseline dummy classifier F1 micro (0.306) while only using 10% of the entire COCO dataset. This research shows that different approaches have the potential to construct models which have the potential to be used for multi-class classification tasks on imbalanced datasets. There are implications that the techniques which were used during this research can be finetuned and optimized even further which in turn leads to better results.

URI

https://studenttheses.uu.nl/handle/20.500.12932/33513

Collections

Theses