COMPARISON OF DATA IMPUTATION METHODS PERFORMANCE FOR MULTIPLE SYSTEM ESTIMATION 
(CASE STUDY: HUMAN TRAFFICKING DATA IN THE NETHERLANDS 2016 - 2019)

Nikolas Anova, Nikolas

View/Open

Nikolas Anova_ADS_Thesis_final.pdf (1.091Mb)

Publication date

2023

Author

Nikolas Anova, Nikolas

Metadata

Show full item record

Summary

Human trafficking is a problem that still occurs in the modern world, and it is necessary to monitor the number of victims. Since human trafficking is a hidden crime, statistics on identified trafficking victims only reveal a small part of the problem, and the actual number of victims can only be estimated. UNODC recommends using Multiple Systems Estimation (MSE), whereby the size of a hidden population of human trafficking victims is estimated by analyzing the overlap between three or more administrative lists on which persons belonging to that population appear. In MSE implementation, one of the main problems is missing data. This problem is most likely to occur in the application of MSE due to the use of registration data from several different external sources. The application of the imputation method should be able to solve missing data problems. Since this problem frequently occurs in MSE implementations, however, based on literature reviews, a comparative study of the imputation method performance based on the MSE output has never been conducted. Case in the Netherlands, the missing data problem in human trafficking records also happened in 2016 – 2019. Nevertheless, in previous studies with the same data, multiple imputation was used only with the default method for binary and 2-level categorical data (i.e., logistic regression). The existence of missing data certainly has reduced the quality of population estimates. However, to produce the best MSE output, choosing the suitable imputation method must be done beforehand. Based on these problems, this study compared the imputation methods performance based on the MSE results in estimating the human trafficking population in the Netherlands from 2016 – 2019. The comparison is seen through the AIC and BIC value of the model. Then the comparison continues between the AIC and BIC version, which is compared based on model complexity, standard error, and reasonableness of estimation. This study focuses on using multiple imputation with seven different methods. These methods are predictive mean matching (PMM), classification and regression trees (CART), random forest, logistic regression, logistic regression with bootstrap, lasso logistic regression, and linear discriminant analysis (LDA). As a result, different imputation methods produced quite varied MSE model scores and population estimation. The CART method produced the best MSE model compared to other imputation methods. The imputed dataset by CART has the best AIC and BIC scores compared to other imputation methods. The logistic regression method used in previous research produced the rank 6th MSE model in both the AIC and BIC versions. On the other hand, random forest is the imputation method that had the worst MSE model compared to the others. These results show that if there is a problem of missing data in the application of MSE, the choice of the imputation method is proven to affect the quality of the output from MSE.

URI

https://studenttheses.uu.nl/handle/20.500.12932/44310

Collections

Theses