COMPARISON OF DATA IMPUTATION METHODS PERFORMANCE FOR MULTIPLE SYSTEM ESTIMATION (CASE STUDY: HUMAN TRAFFICKING DATA IN THE NETHERLANDS 2016 - 2019)
Summary
Human trafficking is a problem that still occurs in the modern world, and it is necessary to monitor
the number of victims. Since human trafficking is a hidden crime, statistics on identified trafficking
victims only reveal a small part of the problem, and the actual number of victims can only be
estimated. UNODC recommends using Multiple Systems Estimation (MSE), whereby the size of a
hidden population of human trafficking victims is estimated by analyzing the overlap between
three or more administrative lists on which persons belonging to that population appear.
In MSE implementation, one of the main problems is missing data. This problem is most likely to
occur in the application of MSE due to the use of registration data from several different external
sources. The application of the imputation method should be able to solve missing data problems.
Since this problem frequently occurs in MSE implementations, however, based on literature
reviews, a comparative study of the imputation method performance based on the MSE output
has never been conducted. Case in the Netherlands, the missing data problem in human
trafficking records also happened in 2016 – 2019. Nevertheless, in previous studies with the same
data, multiple imputation was used only with the default method for binary and 2-level
categorical data (i.e., logistic regression). The existence of missing data certainly has reduced the
quality of population estimates. However, to produce the best MSE output, choosing the suitable
imputation method must be done beforehand.
Based on these problems, this study compared the imputation methods performance based on
the MSE results in estimating the human trafficking population in the Netherlands from 2016 –
2019. The comparison is seen through the AIC and BIC value of the model. Then the comparison
continues between the AIC and BIC version, which is compared based on model complexity,
standard error, and reasonableness of estimation. This study focuses on using multiple imputation
with seven different methods. These methods are predictive mean matching (PMM),
classification and regression trees (CART), random forest, logistic regression, logistic regression
with bootstrap, lasso logistic regression, and linear discriminant analysis (LDA).
As a result, different imputation methods produced quite varied MSE model scores and
population estimation. The CART method produced the best MSE model compared to other
imputation methods. The imputed dataset by CART has the best AIC and BIC scores compared to
other imputation methods. The logistic regression method used in previous research produced
the rank 6th MSE model in both the AIC and BIC versions. On the other hand, random forest is
the imputation method that had the worst MSE model compared to the others. These results
show that if there is a problem of missing data in the application of MSE, the choice of the
imputation method is proven to affect the quality of the output from MSE.