Explainability of Transformers for Authorship Attribution
Summary
Authorship attribution attempts to establish the author of a particular text. In this work, we examine the capabilities of transformer-based models in the subtype of attribution task referred to as authorship verification, which involves determining whether the texts are created by the same author. A few works have been suggested that applied fine-tuned Transformer models in this field. Such approach is motivated by their excellent performance and adaptability (fine-tuning can be performed on texts of different sizes and genres, and different pre-trained model checkpoints enable switching between languages). However, they are not as transparent as the traditional methods, in which features that quantify the style (stylometric features) are selected to maximize the distance between texts. To tackle this problem, we first implement a model for authorship verification based on BERT architecture and then investigate the way its predictions are made by applying an adapted LIME explainer and proposing an attention-based relevant feature extracting procedure. We then compare the two approaches and analyze their explainability from the causal perspective by input ablation and alteration to verify that they can retrieve the features that have a strong influence on the model predictions. We also describe and classify the extracted features from a linguistic perspective.