Using Natural Language Inference to Perform Visual Inference: the Case of Quantified Noun Phrases
Summary
Evaluation of quantities in visual data remains one of the biggest challenges in the area of Visual Inference. We explore a novel approach to reasoning about quantities in visual contexts using the tools of Natural Language Inference, working with textual descriptions of visual scenes. Based on a complete description of a simple geometrical scene, we try to predict if a quantified statement about objects in this scene follows from the description. We test an LSTM-based neural network architecture on this task and examine the generalization ability of the model.