Information-theoretic anomaly detection and authorship attribution in literature
Summary
KRIMP is an algorithm based on information theory capable of capturing arbitrary length co-occurrence relations between itemsets in a database. Cross-compression sizes obtained from KRIMP code tables are a generalization of cross-entropy capable of taking into account such co-occurrence relations. This work investigates the application of KRIMP cross-compression to text data, in anomaly detection and authorship attribution. The question of whether KRIMP can capture grammatical structure and stylistic choices specific to an author that are relevant for attribution is answered by comparing KRIMP to a Naive Bayes classifier, which minimizes cross-entropy and can thus be regarded as a special case of the KRIMP classifier. Experiments on English novels indicated that, when the full alphabet is considered and itemsets are created per sentence, compression of punctuation and word co-occurrences at the sentence level is relevant for the attribution task. KRIMP was more accurate in most of the experiments, showed greater robustness to differences in the size and structure of the corpora and different ways of applying smoothing and also achieved the highest overall accuracies. Experiments using only function words indicated that their power is limited in comparison with using complete alphabets for a large enough training set, and there is no advantage to the use of KRIMP in this case.