Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorSiebes, A.P.J.M.
dc.contributor.advisorFeelders, A.J.
dc.contributor.authorVargas Quiros, J.D.
dc.date.accessioned2018-01-29T18:01:38Z
dc.date.available2018-01-29T18:01:38Z
dc.date.issued2017
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/28488
dc.description.abstractKRIMP is an algorithm based on information theory capable of capturing arbitrary length co-occurrence relations between itemsets in a database. Cross-compression sizes obtained from KRIMP code tables are a generalization of cross-entropy capable of taking into account such co-occurrence relations. This work investigates the application of KRIMP cross-compression to text data, in anomaly detection and authorship attribution. The question of whether KRIMP can capture grammatical structure and stylistic choices specific to an author that are relevant for attribution is answered by comparing KRIMP to a Naive Bayes classifier, which minimizes cross-entropy and can thus be regarded as a special case of the KRIMP classifier. Experiments on English novels indicated that, when the full alphabet is considered and itemsets are created per sentence, compression of punctuation and word co-occurrences at the sentence level is relevant for the attribution task. KRIMP was more accurate in most of the experiments, showed greater robustness to differences in the size and structure of the corpora and different ways of applying smoothing and also achieved the highest overall accuracies. Experiments using only function words indicated that their power is limited in comparison with using complete alphabets for a large enough training set, and there is no advantage to the use of KRIMP in this case.
dc.description.sponsorshipUtrecht University
dc.format.extent1635839
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.titleInformation-theoretic anomaly detection and authorship attribution in literature
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsmachine learning, information theory, classification, anomaly detection, authorship attribution, krimp, naive bayes, cross entropy, stylometry, mdl, minimum description length, wilhelmus
dc.subject.courseuuComputing Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record