dc.rights.license | CC-BY-NC-ND | |
dc.contributor.advisor | Bloothooft, G. | |
dc.contributor.advisor | Schraagen, M. | |
dc.contributor.author | Kemenade, J. van | |
dc.date.accessioned | 2016-08-31T17:01:09Z | |
dc.date.available | 2016-08-31T17:01:09Z | |
dc.date.issued | 2016 | |
dc.identifier.uri | https://studenttheses.uu.nl/handle/20.500.12932/23952 | |
dc.description.abstract | One of the main problems in the field of record linkage is the variation in names. A possible approach for dealing with this variation is to remove name variation. To remove this variation each name in the historical records has to be converted to a base form. In this study a model is presented that can convert Dutch first names to their base form. To build this model a subset of a dataset containing 132.140 first names and their base form will be used to train three different multiclass classifiers: k Nearest Neighbours, Boosted Decision Trees and Support Vector Machines. Each of the classifiers is compared on accuracy, training time and classification speed. The best performing classifier, a boosted decision tree, is then selected for training and testing on the entire dataset. The final model is a boosted decision tree with a learning rate of 1.0 and 200 decision trees with a maximum depth of 17 levels. The validation error of the model, using 10-fold cross validation, is 84.56%. The accuracy of the final model on the test set, containing 24.576 names and 447 base forms, is 85.04% with a classification speed of more than 300 samples per second. | |
dc.description.sponsorship | Utrecht University | |
dc.format.extent | 799724 | |
dc.format.mimetype | application/pdf | |
dc.language.iso | en | |
dc.title | Training a name-variant model using historical data | |
dc.type.content | Bachelor Thesis | |
dc.rights.accessrights | Open Access | |
dc.subject.keywords | name variants, record linkage, multiclass classifcation | |
dc.subject.courseuu | Kunstmatige Intelligentie | |