Text Classification of Dutch police records
Summary
The large databases of government agencies are an interesting source for analyzing. Using government data, tax evaders might be found, and people could be called in for medical testing more or less often based on their data. This thesis project is an effort to use text mining for the classification of police reports. The goal was to train a model based on text mining for correct classification of three classes of online crime: online threat, online distribution of sexually obscene imagery, and computer trespass. To this end different approaches for building a model were compared. Preprocessing steps and model options included linguistic preprocessing, multiple methods of feature construction and selection, boosting and resampling. Four different
algorithms were compared: Naive Bayes, Random Forest, SVM and XGBoost. The resulting models are promising, with F-score increases from random classification by factor 4-11.