View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Information-theoretic anomaly detection and authorship attribution in literature

        Thumbnail
        View/Open
        thesis.pdf (1.560Mb)
        Publication date
        2017
        Author
        Vargas Quiros, J.D.
        Metadata
        Show full item record
        Summary
        KRIMP is an algorithm based on information theory capable of capturing arbitrary length co-occurrence relations between itemsets in a database. Cross-compression sizes obtained from KRIMP code tables are a generalization of cross-entropy capable of taking into account such co-occurrence relations. This work investigates the application of KRIMP cross-compression to text data, in anomaly detection and authorship attribution. The question of whether KRIMP can capture grammatical structure and stylistic choices specific to an author that are relevant for attribution is answered by comparing KRIMP to a Naive Bayes classifier, which minimizes cross-entropy and can thus be regarded as a special case of the KRIMP classifier. Experiments on English novels indicated that, when the full alphabet is considered and itemsets are created per sentence, compression of punctuation and word co-occurrences at the sentence level is relevant for the attribution task. KRIMP was more accurate in most of the experiments, showed greater robustness to differences in the size and structure of the corpora and different ways of applying smoothing and also achieved the highest overall accuracies. Experiments using only function words indicated that their power is limited in comparison with using complete alphabets for a large enough training set, and there is no advantage to the use of KRIMP in this case.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/28488
        Collections
        • Theses
        Utrecht university logo