Domain-Specific Visual Representation Learning Using Natural Language Supervision
Summary
The military intelligence domain is one of many fields investigating deep learning methods to automate various
processes, especially for the task of recognizing specific entities in large sets of images. Current state-of-the-art
methods cannot be easily applied in the military domain since they require large sets of labelled images, which
are challenging to acquire for the domain-specific classes. Recently, research has investigated the possibility of
learning visual features with natural language supervision by using image captioning as a pre-training task for
visual backbones. This study investigates the possibility of pre-training with domain-specific image-captions to
learn domain-specific visual features. We pre-train convolutional neural networks from scratch, using a militaryspecific image-caption dataset (Janes Captions) collected for this study. The effect of different image captioning pre-training tasks on the learning of the visual features was also evaluated. Although these models did not outperform the current state-of-the-art methods, they outperformed models pre-trained on similar amounts of
generic image-captions. Ultimately, natural language supervision for pre-training visual models is a promising
concept that, if applied correctly, could solve the problems of current state-of-the-art methods, especially for
application in specific domains.