Domain-Specific Visual Representation Learning Using Natural Language Supervision
MetadataShow full item record
The military intelligence domain is one of many fields investigating deep learning methods to automate various processes, especially for the task of recognizing specific entities in large sets of images. Current state-of-the-art methods cannot be easily applied in the military domain since they require large sets of labelled images, which are challenging to acquire for the domain-specific classes. Recently, research has investigated the possibility of learning visual features with natural language supervision by using image captioning as a pre-training task for visual backbones. This study investigates the possibility of pre-training with domain-specific image-captions to learn domain-specific visual features. We pre-train convolutional neural networks from scratch, using a militaryspecific image-caption dataset (Janes Captions) collected for this study. The effect of different image captioning pre-training tasks on the learning of the visual features was also evaluated. Although these models did not outperform the current state-of-the-art methods, they outperformed models pre-trained on similar amounts of generic image-captions. Ultimately, natural language supervision for pre-training visual models is a promising concept that, if applied correctly, could solve the problems of current state-of-the-art methods, especially for application in specific domains.