Do more ’humanlike’ vision-language models perform better on grounding challenges? An attribution-based study on the VALSE image-caption alignment benchmark
Summary
Vision-language models (VLMs) are increasingly successful, but questions remain about the extent and nature of their grounding in the visual modality. Many prior approaches to this question tend to focus on either performance-based measures of grounding (what can a model do?) or comparisons between a model’s internal representations and a normative human baseline (is a model doing things in a humanlike way?). This study tests whether the results of each of these two approaches are correlated with one another in the context of a benchmark specifically designed to measure grounding. I design a human experimental environment to extract human saliency maps for a subset of the VALSE grounding benchmark. I also generate attribution maps for four VLMs for the same stimuli. My analysis creates a "humanlikeness" similarity metric for visual model attribution maps, and finds that model attribution maps are detectably "humanlike" on average. However, the degree of attribution humanlikeness does not correlate with model performance on the VALSE benchmark, either between or within models. The utility of this attribution-based humanlikeness metric as a complement to performance-based benchmarks remains unclear.