How does CLIP process negation? A multimodal interpretability study
Summary
Various benchmarks have measured linguistic capabilities of vision-and-language (VL) models, but do not provide insights into how models implement these capabilities. This thesis translates model interpretability techniques developed for large language models to the mul- timodal space in order to investigate the mechanisms involved in CLIP’s processing of negation. In the text encoder, specific negator-selective attention heads are found that seem crucial in controlling the movement of negation-related information through the model. Early evidence suggests that these heads are dataset-independent. In the image encoder, MLPs seem more relevant than attention, particularly in early layers, but further research is needed to elucidate these processes. As for CLIP’s imperfect ability to process negation correctly, multiple dataset features are identified that partly explain its performance, suggesting that benchmark performance isn’t a direct indicator of linguistic understanding. Future research directions are discussed that refine our understanding of the discovered mechanisms and test their generalisability on other datasets and models.