Skewed Selective Acquisition: Sampling Bias in Active Learning and its Influence on Operational Classification Performance
MetadataShow full item record
Active learning aims to provide supervised learning models with highly informative and succinct data, but can also introduce sampling bias as the resulting labelled dataset often no longer follows the original data distribution. Due to sampling bias, the training data no longer represents the environment that the model will operate in, which can either be harmful or helpful to classifier performance. One form of sampling bias is class bias, where the labelled dataset no longer follows the class distribution of the population data. Through experimenting on 15 different binary-classification datasets, this thesis studied active learning sampling bias through class bias and its relation to operational classification performance. First, this thesis investigated four main factors in the active learning cycle that might influence active learning sampling bias in labelled training data. The chosen factors were the choice of active learning algorithm, machine learning classifier, the level of class imbalance in the unlabelled data pool and the class ratio of the initial training set. All factors had an effect on sampling bias and on classifier performance, with varying degrees of severity. The level of class imbalance had the largest influence on active learning algorithms introducing more sampling bias. Afterwards, various experiments were conducted using three active learning debiasing methods: hierarchical sampling, QUerying Informative and Representative Instances (QUIRE) and Active Learning By Learning (ALBL). These query strategies were compared in terms of trained classifier performance and sampling bias mitigation. In these experiments, utilizing informative-based query strategies like uncertainty sampling led to the highest amount of sampling bias but also the highest performance. While the three debiasing methods resulted in less sampling bias in the labelled dataset, they were generally outperformed by uncertainty sampling. Of these methods, hierarchical sampling performed the best, achieving a performance which was marginally worse than uncertainty sampling. These results suggest that using active learning algorithms that introduce sampling bias can boost performance, especially in high class imbalance situations. However, careful consideration should be taken before implementing more sampling bias introducing active learning algorithms. In cases where sampling bias in training data is harmful, either by raising issues of fairness or by yielding a shortsighted classifier, using a well-performing but more representative-based active learning method like hierarchical sampling or density-weighted sampling is recommended.