Training a name-variant model using historical data

Kemenade, J. van

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Bloothooft, G.
dc.contributor.advisor	Schraagen, M.
dc.contributor.author	Kemenade, J. van
dc.date.accessioned	2016-08-31T17:01:09Z
dc.date.available	2016-08-31T17:01:09Z
dc.date.issued	2016
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/23952
dc.description.abstract	One of the main problems in the field of record linkage is the variation in names. A possible approach for dealing with this variation is to remove name variation. To remove this variation each name in the historical records has to be converted to a base form. In this study a model is presented that can convert Dutch first names to their base form. To build this model a subset of a dataset containing 132.140 first names and their base form will be used to train three different multiclass classifiers: k Nearest Neighbours, Boosted Decision Trees and Support Vector Machines. Each of the classifiers is compared on accuracy, training time and classification speed. The best performing classifier, a boosted decision tree, is then selected for training and testing on the entire dataset. The final model is a boosted decision tree with a learning rate of 1.0 and 200 decision trees with a maximum depth of 17 levels. The validation error of the model, using 10-fold cross validation, is 84.56%. The accuracy of the final model on the test set, containing 24.576 names and 447 base forms, is 85.04% with a classification speed of more than 300 samples per second.
dc.description.sponsorship	Utrecht University
dc.format.extent	799724
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.title	Training a name-variant model using historical data
dc.type.content	Bachelor Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	name variants, record linkage, multiclass classifcation
dc.subject.courseuu	Kunstmatige Intelligentie

Files in this item

Name:: thesis.pdf
Size:: 780.9Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record