Using Active Learning to Mitigate Sampling Bias: a Comparison of Two Algorithms
Summary
In machine learning we often have to work with imperfect data. One way in which data can be imperfect is through sampling bias. When a data sample is biased, it doesn't accurately represent the population. This makes it challenging to train a well-generalizing and unbiased classifier, which can cause a machine learning algorithm to be unfair. A lot of techniques have been produced to deal with this challenge. These vary from pre-processing techniques, which focus on mitigating bias in the data before beginning training, to in-process techniques, which alter the training process to reduce bias, to post-processing techniques. In this thesis we compare two active learning algorithms. One of them is an in-process technique that is specifically designed to target sampling bias. We compare this technique to a more conventional simpler active learning algorithm. Based on experiments on benchmark data, we compare how both algorithms perform in terms of improving the performance of the classifier, making the sample more closely resemble the “population” and other factors. We find that the conventional active learning algorithm actually seems to outperform the sampling bias targeted algorithm on most judging criteria and for most settings. Besides that, we also do an experiment on biased toxicology data from the RijksInstituut voor Volksgezondheid en Mileu(RIVM)