Information-theoretic anomaly detection and authorship attribution in literature

Vargas Quiros, J.D.

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Siebes, A.P.J.M.
dc.contributor.advisor	Feelders, A.J.
dc.contributor.author	Vargas Quiros, J.D.
dc.date.accessioned	2018-01-29T18:01:38Z
dc.date.available	2018-01-29T18:01:38Z
dc.date.issued	2017
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/28488
dc.description.abstract	KRIMP is an algorithm based on information theory capable of capturing arbitrary length co-occurrence relations between itemsets in a database. Cross-compression sizes obtained from KRIMP code tables are a generalization of cross-entropy capable of taking into account such co-occurrence relations. This work investigates the application of KRIMP cross-compression to text data, in anomaly detection and authorship attribution. The question of whether KRIMP can capture grammatical structure and stylistic choices specific to an author that are relevant for attribution is answered by comparing KRIMP to a Naive Bayes classifier, which minimizes cross-entropy and can thus be regarded as a special case of the KRIMP classifier. Experiments on English novels indicated that, when the full alphabet is considered and itemsets are created per sentence, compression of punctuation and word co-occurrences at the sentence level is relevant for the attribution task. KRIMP was more accurate in most of the experiments, showed greater robustness to differences in the size and structure of the corpora and different ways of applying smoothing and also achieved the highest overall accuracies. Experiments using only function words indicated that their power is limited in comparison with using complete alphabets for a large enough training set, and there is no advantage to the use of KRIMP in this case.
dc.description.sponsorship	Utrecht University
dc.format.extent	1635839
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.title	Information-theoretic anomaly detection and authorship attribution in literature
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	machine learning, information theory, classification, anomaly detection, authorship attribution, krimp, naive bayes, cross entropy, stylometry, mdl, minimum description length, wilhelmus
dc.subject.courseuu	Computing Science

Files in this item

Name:: thesis.pdf
Size:: 1.560Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record