Machine-annotated rationales: faithfully explaining machine learning models for text classification
MetadataShow full item record
Artificial intelligence is not always interpretable to humans at first sight. Especially machine learning models with hidden states or high complexity remain difficult to understand. Explanations for such machine learning models can be found, but are not always faithful: according to the actual reasoning that was done inside the model. Finding parts of the model input that contain signals for a classification can be a way of explaining model outputs. Natural language explanations are called rationales. Whoever annotated a part of the text as being an explanation (rationale), is called the annotator. Texts form decomposable sets of interpretable features, where selections of (sub-)sentences can be explanations for model predictions. To find explanations for machine model predictions in text classification, this study introduces machine-annotated rationales, which are natural language explanations from the input text for a model's prediction. Four different approaches to finding faithful machine-annotated rationales are proposed. Evaluation is done by measuring faithfulness, set similarity to human-annotated rationales, and through a user evaluation. Results show that faithful machine-annotated rationales can be found for the investigated machine learning models, but that there is a trade-off between faithfulness and end-user interpretability.